Tokenizer
in package
This class has a collection of methods for Portuguese locale specific tokenization. In particular, it has a stemmer implementing the Snowball Stemming algorithm presented in http://snowball.tartarus.org/algorithms/portuguese/stemmer.html
Tags
Table of Contents
- $semantic_rewrites : array<string|int, mixed>
- Phrases we would like yioop to rewrite before performing a query
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
- $buffer : string
- storage used in computing the stem
- $k : int
- Index of the current end of the word at the current state of computing its stem
- $r1 : string
- R1 is the region in the word after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel
- $r2 : string
- R2 is the region in the R1 after the first non-vowel following a vowel, or is the null region at the end of the word if there is no non-vowel
- $rv : string
- If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.
- segment() : string
- Stub function which could be used for a word segmenter.
- stem() : string
- Computes the stem of an Portuguese word For example, química, químicas, químico, químicos all have químic as a stem
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation and language detection)
- findR1() : string
- This method will find R1 region in the $word R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel
- findRV() : string
- This method will find RV region in the $word If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.
- mbStringToArray() : array<string|int, mixed>
- This method will break-up a multibyte string into its individual characters and generate an array of characters
- step1() : processed
- Standard Suffix Removal Step It search for longest suffix from given set and remove if found
- step2() : processed
- Verb Suffix Removal Step If step 1 does not change anything than this function will be called
- step3() : processed
- Delete suffix i if in RV and preceded by c
- step4() : processed
- Residual suffix If the word ends with one of [os a i o á í ó] in RV
- step5() : processed
- Residual suffix If the word ends with one of [e é ê] in RV
Properties
$semantic_rewrites
Phrases we would like yioop to rewrite before performing a query
public
static array<string|int, mixed>
$semantic_rewrites
= []
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
public
static mixed
$stop_words
= ['como', 'I', 'seu', 'ele', 'foi', 'para', 'em', 'são', 'com', 'eles', 'ser', 'em', 'uma', 'tem', 'este', 'partir', 'de', 'por', 'quente', 'palavra', 'mas', 'que', 'alguns', 'é', 'ele', 'você', 'ou', 'teve', 'o', 'a', 'e', 'uma', 'em', 'nós', 'lata', 'fora', 'outro', 'foram', 'que', 'fazer', 'seu', 'tempo', 'se', 'vontade', 'como', 'disse', 'uma', 'cada', 'dizer', 'faz', 'conjunto', 'três', 'quer', 'ar', 'bem', 'também', 'jogar', 'pequeno', 'fim', 'colocar', 'casa', 'ler', 'mão', 'port', 'grande', 'soletrar', 'adicionar', 'mesmo', 'terra', 'aqui', 'necessário', 'grande', 'alto', 'tais', 'siga', 'ato', 'perguntar', 'homens', 'mudança', 'fui', 'luz', 'tipo', 'off', 'precisa', 'casa', 'imagem', 'tentar', 'nós', 'novamente', 'animais', 'ponto', 'mãe', 'mundo', 'perto', 'construir', 'auto', 'terra', 'pai']
Tags
$buffer
storage used in computing the stem
private
static string
$buffer
$k
Index of the current end of the word at the current state of computing its stem
private
static int
$k
$r1
R1 is the region in the word after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel
private
static string
$r1
= ""
$r2
R2 is the region in the R1 after the first non-vowel following a vowel, or is the null region at the end of the word if there is no non-vowel
private
static string
$r2
= ""
$rv
If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.
private
static string
$rv
= ""
Methods
segment()
Stub function which could be used for a word segmenter.
public
static segment(string $pre_segment) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
Return values
string —should return string with words separated by space in this case does nothing
stem()
Computes the stem of an Portuguese word For example, química, químicas, químico, químicos all have químic as a stem
public
static stem(string $word) : string
Parameters
- $word : string
-
the string to stem
Return values
string —the stem of $words
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation and language detection)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
findR1()
This method will find R1 region in the $word R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel
private
static findR1(string $word) : string
Parameters
- $word : string
Return values
string —$r1 region
findRV()
This method will find RV region in the $word If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.
private
static findRV(string $word) : string
Parameters
- $word : string
Return values
string —$rv region
mbStringToArray()
This method will break-up a multibyte string into its individual characters and generate an array of characters
private
static mbStringToArray(string $string) : array<string|int, mixed>
Parameters
- $string : string
-
of multibyte characters to break-up
Return values
array<string|int, mixed> —of multibyte characters
step1()
Standard Suffix Removal Step It search for longest suffix from given set and remove if found
private
static step1(string $word) : processed
Parameters
- $word : string
-
the string to suffix removal
Return values
processed —string
step2()
Verb Suffix Removal Step If step 1 does not change anything than this function will be called
private
static step2(string $word) : processed
It will also check for longest suffix from the suffix set Remove if found
Parameters
- $word : string
-
the string to suffix removal
Return values
processed —string
step3()
Delete suffix i if in RV and preceded by c
private
static step3(string $word) : processed
Parameters
- $word : string
-
the string to suffix removal
Return values
processed —string
step4()
Residual suffix If the word ends with one of [os a i o á í ó] in RV
private
static step4(string $word) : processed
Parameters
- $word : string
-
the string to suffix removal
Return values
processed —string
step5()
Residual suffix If the word ends with one of [e é ê] in RV
private
static step5(string $word) : processed
Parameters
- $word : string
-
the string to suffix removal
Return values
processed —string