Tokenizer
in package
This class has a collection of methods for Russian locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is a modification (with bug fixes ) of Dennis Kreminsky's stemmer from: http://snowball.tartarus.org/otherlangs/russian_php5.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
Tags
Table of Contents
- CHAR_LENGTH = 2
- Num bytes of Russian unicode char.
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
- segment() : string
- Stub function which could be used for a word segmenter.
- stem() : string
- Computes the stem of a Russian word
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation)
- rv() : array<string|int, mixed>
- Compute the RV region of a word. RV is the region after the first vowel, or the end of the word if it contains no vowel.
- step1() : string
- Search for a PERFECTIVE GERUND ending. If one is found remove it, and that is then the end of step 1. Otherwise try and remove a REFLEXIVE ending, and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a NOUN ending.
- step2() : string
- If the word ends with и (i), remove it.
- step3() : string
- Search for a DERIVATIONAL ending in R2 (i.e. the entire ending must lie in R2), and if one is found, remove it.
- step4() : string
- 1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending, remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.
Constants
CHAR_LENGTH
Num bytes of Russian unicode char.
public
mixed
CHAR_LENGTH
= 2
Properties
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= []
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
public
static mixed
$stop_words
= ["й", "ч", "чп", "ое", "юфп", "по", "об", "с", "у", "уп", "лбл", "б", "фп", "чуе", "поб", "фбл", "езп", "оп", "дб", "фщ", "л", "х", "це", "чщ", "ъб", "вщ", "рп", "фпмшлп", "ее", "ное", "вщмп", "чпф", "пф", "неос", "еэе", "оеф ", "п", "йъ", "енх", "феретш", "лпздб", "дбце", "ох ", "чдтхз", "мй", "еумй", "хце", "ймй", "ой", "вщфш", "вщм", "оезп", "дп", "чбу", "ойвхдш", "прсфш", "хц", "чбн", "улбъбм", "чедш", "фбн", "рпфпн", "уевс", "ойюезп", "ек", "нпцеф", "пой", "фхф", "зде", "еуфш", "обдп", "оек", "дмс", "нщ", "февс", "йи", "юен", "вщмб", "убн", "юфпв", "веъ ", "вхдфп", "юемпчел", "юезп", "тбъ", "фпце", "уеве", "рпд", "цйъош", "вхдеф", "ц", "фпздб", "лфп", "ьфпф", "зпчптйм", "фпзп", "рпфпнх", "ьфпзп", "лблпк", "упчуен", "ойн", "ъдеуш", "ьфпн", "пдйо", "рпюфй", "нпк", "фен", "юфпвщ", "оее", "лбцефус", "уекюбу", "вщмй", "лхдб", "ъбюен", "улбъбфш", "чуеи", "ойлпздб", "уезпдос", "нпцоп", "ртй", "облпоег", "дчб", "пв", "дтхзпк", "ипфш", "рпуме", "обд", "впмшые", "фпф", "юетеъ", "ьфй", "обу", "ртп", "чуезп", "ойи", "лблбс", "нопзп", "тбъче", "улбъбмб", "фтй", "ьфх", "нпс", "чртпюен", "иптпып", "учпа", "ьфпк", "ретед", "йопздб", "мхюые", "юхфш", "фпн", "оемшъс", "фблпк", "йн", "впмее", "чуездб", "лпоеюоп", "чуа", "нецдх", 'http', 'https']
Tags
Methods
segment()
Stub function which could be used for a word segmenter.
public
static segment(string $pre_segment) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
Return values
string —should return string with words separated by space in this case does nothing
stem()
Computes the stem of a Russian word
public
static stem(string $word) : string
Parameters
- $word : string
-
the string to stem
Return values
string —the stem of $words
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
rv()
Compute the RV region of a word. RV is the region after the first vowel, or the end of the word if it contains no vowel.
private
static rv(string $word) : array<string|int, mixed>
Parameters
- $word : string
-
word to compute rv regions for
Return values
array<string|int, mixed> —pair string before rv, string after rv
step1()
Search for a PERFECTIVE GERUND ending. If one is found remove it, and that is then the end of step 1. Otherwise try and remove a REFLEXIVE ending, and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a NOUN ending.
private
static step1(string $word) : string
As soon as one of the endings (1) to (3) is found remove it, and terminate step 1.
Parameters
- $word : string
-
word to stem
Return values
string —$word after step
step2()
If the word ends with и (i), remove it.
private
static step2(string $word) : string
Parameters
- $word : string
-
word to stem
Return values
string —$word after step
step3()
Search for a DERIVATIONAL ending in R2 (i.e. the entire ending must lie in R2), and if one is found, remove it.
private
static step3(string $word) : string
Parameters
- $word : string
-
word to stem
Return values
string —$word after step
step4()
1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending, remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.
private
static step4(string $word) : string
Parameters
- $word : string
-
word to stem
Return values
string —$word after step