Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

This class has a collection of methods for Russian locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is a modification (with bug fixes ) of Dennis Kreminsky's stemmer from: http://snowball.tartarus.org/otherlangs/russian_php5.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Tags
author

Chris Pollett

Table of Contents

CHAR_LENGTH  = 2
Num bytes of Russian unicode char.
$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of a Russian word
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation)
rv()  : array<string|int, mixed>
Compute the RV region of a word. RV is the region after the first vowel, or the end of the word if it contains no vowel.
step1()  : string
Search for a PERFECTIVE GERUND ending. If one is found remove it, and that is then the end of step 1. Otherwise try and remove a REFLEXIVE ending, and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a NOUN ending.
step2()  : string
If the word ends with и (i), remove it.
step3()  : string
Search for a DERIVATIONAL ending in R2 (i.e. the entire ending must lie in R2), and if one is found, remove it.
step4()  : string
1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending, remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.

Constants

CHAR_LENGTH

Num bytes of Russian unicode char.

public mixed CHAR_LENGTH = 2

Properties

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = []

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

public static mixed $stop_words = ["й", "ч", "чп", "ое", "юфп", "по", "об", "с", "у", "уп", "лбл", "б", "фп", "чуе", "поб", "фбл", "езп", "оп", "дб", "фщ", "л", "х", "це", "чщ", "ъб", "вщ", "рп", "фпмшлп", "ее", "ное", "вщмп", "чпф", "пф", "неос", "еэе", "оеф ", "п", "йъ", "енх", "феретш", "лпздб", "дбце", "ох ", "чдтхз", "мй", "еумй", "хце", "ймй", "ой", "вщфш", "вщм", "оезп", "дп", "чбу", "ойвхдш", "прсфш", "хц", "чбн", "улбъбм", "чедш", "фбн", "рпфпн", "уевс", "ойюезп", "ек", "нпцеф", "пой", "фхф", "зде", "еуфш", "обдп", "оек", "дмс", "нщ", "февс", "йи", "юен", "вщмб", "убн", "юфпв", "веъ ", "вхдфп", "юемпчел", "юезп", "тбъ", "фпце", "уеве", "рпд", "цйъош", "вхдеф", "ц", "фпздб", "лфп", "ьфпф", "зпчптйм", "фпзп", "рпфпнх", "ьфпзп", "лблпк", "упчуен", "ойн", "ъдеуш", "ьфпн", "пдйо", "рпюфй", "нпк", "фен", "юфпвщ", "оее", "лбцефус", "уекюбу", "вщмй", "лхдб", "ъбюен", "улбъбфш", "чуеи", "ойлпздб", "уезпдос", "нпцоп", "ртй", "облпоег", "дчб", "пв", "дтхзпк", "ипфш", "рпуме", "обд", "впмшые", "фпф", "юетеъ", "ьфй", "обу", "ртп", "чуезп", "ойи", "лблбс", "нопзп", "тбъче", "улбъбмб", "фтй", "ьфх", "нпс", "чртпюен", "иптпып", "учпа", "ьфпк", "ретед", "йопздб", "мхюые", "юхфш", "фпн", "оемшъс", "фблпк", "йн", "впмее", "чуездб", "лпоеюоп", "чуа", "нецдх", 'http', 'https']
Tags
array

Methods

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a Russian word

public static stem(string $word) : string
Parameters
$word : string

the string to stem

Return values
string

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

rv()

Compute the RV region of a word. RV is the region after the first vowel, or the end of the word if it contains no vowel.

private static rv(string $word) : array<string|int, mixed>
Parameters
$word : string

word to compute rv regions for

Return values
array<string|int, mixed>

pair string before rv, string after rv

step1()

Search for a PERFECTIVE GERUND ending. If one is found remove it, and that is then the end of step 1. Otherwise try and remove a REFLEXIVE ending, and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a NOUN ending.

private static step1(string $word) : string

As soon as one of the endings (1) to (3) is found remove it, and terminate step 1.

Parameters
$word : string

word to stem

Return values
string

$word after step

step2()

If the word ends with и (i), remove it.

private static step2(string $word) : string
Parameters
$word : string

word to stem

Return values
string

$word after step

step3()

Search for a DERIVATIONAL ending in R2 (i.e. the entire ending must lie in R2), and if one is found, remove it.

private static step3(string $word) : string
Parameters
$word : string

word to stem

Return values
string

$word after step

step4()

1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending, remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.

private static step4(string $word) : string
Parameters
$word : string

word to stem

Return values
string

$word after step


        

Search results