Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

This class has a collection of methods for Portuguese locale specific tokenization. In particular, it has a stemmer implementing the Snowball Stemming algorithm presented in http://snowball.tartarus.org/algorithms/portuguese/stemmer.html

Tags
author

Niravkumar Patel

Table of Contents

$semantic_rewrites  : array<string|int, mixed>
Phrases we would like yioop to rewrite before performing a query
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
$buffer  : string
storage used in computing the stem
$k  : int
Index of the current end of the word at the current state of computing its stem
$r1  : string
R1 is the region in the word after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel
$r2  : string
R2 is the region in the R1 after the first non-vowel following a vowel, or is the null region at the end of the word if there is no non-vowel
$rv  : string
If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of an Portuguese word For example, química, químicas, químico, químicos all have químic as a stem
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation and language detection)
findR1()  : string
This method will find R1 region in the $word R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel
findRV()  : string
This method will find RV region in the $word If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.
mbStringToArray()  : array<string|int, mixed>
This method will break-up a multibyte string into its individual characters and generate an array of characters
step1()  : processed
Standard Suffix Removal Step It search for longest suffix from given set and remove if found
step2()  : processed
Verb Suffix Removal Step If step 1 does not change anything than this function will be called
step3()  : processed
Delete suffix i if in RV and preceded by c
step4()  : processed
Residual suffix If the word ends with one of [os a i o á í ó] in RV
step5()  : processed
Residual suffix If the word ends with one of [e é ê] in RV

Properties

$semantic_rewrites

Phrases we would like yioop to rewrite before performing a query

public static array<string|int, mixed> $semantic_rewrites = []

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

public static mixed $stop_words = ['como', 'I', 'seu', 'ele', 'foi', 'para', 'em', 'são', 'com', 'eles', 'ser', 'em', 'uma', 'tem', 'este', 'partir', 'de', 'por', 'quente', 'palavra', 'mas', 'que', 'alguns', 'é', 'ele', 'você', 'ou', 'teve', 'o', 'a', 'e', 'uma', 'em', 'nós', 'lata', 'fora', 'outro', 'foram', 'que', 'fazer', 'seu', 'tempo', 'se', 'vontade', 'como', 'disse', 'uma', 'cada', 'dizer', 'faz', 'conjunto', 'três', 'quer', 'ar', 'bem', 'também', 'jogar', 'pequeno', 'fim', 'colocar', 'casa', 'ler', 'mão', 'port', 'grande', 'soletrar', 'adicionar', 'mesmo', 'terra', 'aqui', 'necessário', 'grande', 'alto', 'tais', 'siga', 'ato', 'perguntar', 'homens', 'mudança', 'fui', 'luz', 'tipo', 'off', 'precisa', 'casa', 'imagem', 'tentar', 'nós', 'novamente', 'animais', 'ponto', 'mãe', 'mundo', 'perto', 'construir', 'auto', 'terra', 'pai']
Tags
array

$buffer

storage used in computing the stem

private static string $buffer

$k

Index of the current end of the word at the current state of computing its stem

private static int $k

$r1

R1 is the region in the word after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel

private static string $r1 = ""

$r2

R2 is the region in the R1 after the first non-vowel following a vowel, or is the null region at the end of the word if there is no non-vowel

private static string $r2 = ""

$rv

If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.

private static string $rv = ""

Methods

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of an Portuguese word For example, química, químicas, químico, químicos all have químic as a stem

public static stem(string $word) : string
Parameters
$word : string

the string to stem

Return values
string

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

findR1()

This method will find R1 region in the $word R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel

private static findR1(string $word) : string
Parameters
$word : string
Return values
string

$r1 region

findRV()

This method will find RV region in the $word If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.

private static findRV(string $word) : string
Parameters
$word : string
Return values
string

$rv region

mbStringToArray()

This method will break-up a multibyte string into its individual characters and generate an array of characters

private static mbStringToArray(string $string) : array<string|int, mixed>
Parameters
$string : string

of multibyte characters to break-up

Return values
array<string|int, mixed>

of multibyte characters

step1()

Standard Suffix Removal Step It search for longest suffix from given set and remove if found

private static step1(string $word) : processed
Parameters
$word : string

the string to suffix removal

Return values
processed

string

step2()

Verb Suffix Removal Step If step 1 does not change anything than this function will be called

private static step2(string $word) : processed

It will also check for longest suffix from the suffix set Remove if found

Parameters
$word : string

the string to suffix removal

Return values
processed

string

step3()

Delete suffix i if in RV and preceded by c

private static step3(string $word) : processed
Parameters
$word : string

the string to suffix removal

Return values
processed

string

step4()

Residual suffix If the word ends with one of [os a i o á í ó] in RV

private static step4(string $word) : processed
Parameters
$word : string

the string to suffix removal

Return values
processed

string

step5()

Residual suffix If the word ends with one of [e é ê] in RV

private static step5(string $word) : processed
Parameters
$word : string

the string to suffix removal

Return values
processed

string


        

Search results