Yioop_V9.5_Source_Code

Tokenizer
in package

Application

This class has a collection of methods for Russian locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is a modification (with bug fixes ) of Dennis Kreminsky's stemmer from: http://snowball.tartarus.org/otherlangs/russian_php5.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

CHAR_LENGTH

Num bytes of Russian unicode char.


    public
        mixed
    CHAR_LENGTH
    = 2

$no_stem_list

Words we don't want to be stemmed


    public
    static    array<string|int, mixed>
    $no_stem_list
     = []

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries


    public
    static    mixed
    $stop_words
     = ["й", "ч", "чп", "ое", "юфп", "по", "об", "с", "у", "уп", "лбл", "б", "фп", "чуе", "поб", "фбл", "езп", "оп", "дб", "фщ", "л", "х", "це", "чщ", "ъб", "вщ", "рп", "фпмшлп", "ее", "ное", "вщмп", "чпф", "пф", "неос", "еэе", "оеф ", "п", "йъ", "енх", "феретш", "лпздб", "дбце", "ох ", "чдтхз", "мй", "еумй", "хце", "ймй", "ой", "вщфш", "вщм", "оезп", "дп", "чбу", "ойвхдш", "прсфш", "хц", "чбн", "улбъбм", "чедш", "фбн", "рпфпн", "уевс", "ойюезп", "ек", "нпцеф", "пой", "фхф", "зде", "еуфш", "обдп", "оек", "дмс", "нщ", "февс", "йи", "юен", "вщмб", "убн", "юфпв", "веъ ", "вхдфп", "юемпчел", "юезп", "тбъ", "фпце", "уеве", "рпд", "цйъош", "вхдеф", "ц", "фпздб", "лфп", "ьфпф", "зпчптйм", "фпзп", "рпфпнх", "ьфпзп", "лблпк", "упчуен", "ойн", "ъдеуш", "ьфпн", "пдйо", "рпюфй", "нпк", "фен", "юфпвщ", "оее", "лбцефус", "уекюбу", "вщмй", "лхдб", "ъбюен", "улбъбфш", "чуеи", "ойлпздб", "уезпдос", "нпцоп", "ртй", "облпоег", "дчб", "пв", "дтхзпк", "ипфш", "рпуме", "обд", "впмшые", "фпф", "юетеъ", "ьфй", "обу", "ртп", "чуезп", "ойи", "лблбс", "нопзп", "тбъче", "улбъбмб", "фтй", "ьфх", "нпс", "чртпюен", "иптпып", "учпа", "ьфпк", "ретед", "йопздб", "мхюые", "юхфш", "фпн", "оемшъс", "фблпк", "йн", "впмее", "чуездб", "лпоеюоп", "чуа", "нецдх", 'http', 'https']

segment()

Stub function which could be used for a word segmenter.


    public
            static        segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

$pre_segment : string: before segmentation

Return values

string —

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a Russian word


    public
            static        stem(string $word) : string

Parameters

$word : string: the string to stem

Return values

string —

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

rv()

Compute the RV region of a word. RV is the region after the first vowel, or the end of the word if it contains no vowel.


    private
            static        rv(string $word) : array<string|int, mixed>

Parameters

$word : string: word to compute rv regions for

Return values

array<string|int, mixed> —

pair string before rv, string after rv

step1()

Search for a PERFECTIVE GERUND ending. If one is found remove it, and that is then the end of step 1. Otherwise try and remove a REFLEXIVE ending, and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a NOUN ending.


    private
            static        step1(string $word) : string

As soon as one of the endings (1) to (3) is found remove it, and terminate step 1.

Parameters

$word : string: word to stem

Return values

string —

$word after step

step2()

If the word ends with и (i), remove it.


    private
            static        step2(string $word) : string

Parameters

$word : string: word to stem

Return values

string —

$word after step

step3()

Search for a DERIVATIONAL ending in R2 (i.e. the entire ending must lie in R2), and if one is found, remove it.


    private
            static        step3(string $word) : string

Parameters

$word : string: word to stem

Return values

string —

$word after step

step4()

1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending, remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.


    private
            static        step4(string $word) : string

Parameters

$word : string: word to stem

Return values

string —

$word after step

Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Application

Tags

Table of Contents

Constants

CHAR_LENGTH

Properties

$no_stem_list

$stop_words

Tags

Methods

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

rv()

Parameters

Return values

step1()

Parameters

Return values

step2()

Parameters

Return values

step3()

Parameters

Return values

step4()

Parameters

Return values

Search results

Tokenizer in package Application

Tags

Table of Contents

Constants

CHAR_LENGTH

Properties

$no_stem_list

$stop_words

Tags

Methods

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

rv()

Parameters

Return values

step1()

Parameters

Return values

step2()

Parameters

Return values

step3()

Parameters

Return values

step4()

Parameters

Return values

Tokenizer
in package

Application