Yioop_V9.5_Source_Code

Tokenizer
in package

Application

This class has a collection of methods for Portuguese locale specific tokenization. In particular, it has a stemmer implementing the Snowball Stemming algorithm presented in http://snowball.tartarus.org/algorithms/portuguese/stemmer.html

$semantic_rewrites

Phrases we would like yioop to rewrite before performing a query


    public
    static    array<string|int, mixed>
    $semantic_rewrites
     = []

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection


    public
    static    mixed
    $stop_words
     = ['como', 'I', 'seu', 'ele', 'foi', 'para', 'em', 'são', 'com', 'eles', 'ser', 'em', 'uma', 'tem', 'este', 'partir', 'de', 'por', 'quente', 'palavra', 'mas', 'que', 'alguns', 'é', 'ele', 'você', 'ou', 'teve', 'o', 'a', 'e', 'uma', 'em', 'nós', 'lata', 'fora', 'outro', 'foram', 'que', 'fazer', 'seu', 'tempo', 'se', 'vontade', 'como', 'disse', 'uma', 'cada', 'dizer', 'faz', 'conjunto', 'três', 'quer', 'ar', 'bem', 'também', 'jogar', 'pequeno', 'fim', 'colocar', 'casa', 'ler', 'mão', 'port', 'grande', 'soletrar', 'adicionar', 'mesmo', 'terra', 'aqui', 'necessário', 'grande', 'alto', 'tais', 'siga', 'ato', 'perguntar', 'homens', 'mudança', 'fui', 'luz', 'tipo', 'off', 'precisa', 'casa', 'imagem', 'tentar', 'nós', 'novamente', 'animais', 'ponto', 'mãe', 'mundo', 'perto', 'construir', 'auto', 'terra', 'pai']

$buffer

storage used in computing the stem


    private
    static    string
    $buffer

$k

Index of the current end of the word at the current state of computing its stem


    private
    static    int
    $k

$r1

R1 is the region in the word after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel


    private
    static    string
    $r1
     = ""

$r2

R2 is the region in the R1 after the first non-vowel following a vowel, or is the null region at the end of the word if there is no non-vowel


    private
    static    string
    $r2
     = ""

$rv

If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.


    private
    static    string
    $rv
     = ""

segment()

Stub function which could be used for a word segmenter.


    public
            static        segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

$pre_segment : string: before segmentation

Return values

string —

should return string with words separated by space in this case does nothing

stem()

Computes the stem of an Portuguese word For example, química, químicas, químico, químicos all have químic as a stem


    public
            static        stem(string $word) : string

Parameters

$word : string: the string to stem

Return values

string —

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

findR1()

This method will find R1 region in the $word R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel


    private
            static        findR1(string $word) : string

Parameters

$word : string

Return values

string —

$r1 region

findRV()

This method will find RV region in the $word If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.


    private
            static        findRV(string $word) : string

Parameters

$word : string

Return values

string —

$rv region

mbStringToArray()

This method will break-up a multibyte string into its individual characters and generate an array of characters


    private
            static        mbStringToArray(string $string) : array<string|int, mixed>

Parameters

$string : string: of multibyte characters to break-up

Return values

array<string|int, mixed> —

of multibyte characters

step1()

Standard Suffix Removal Step It search for longest suffix from given set and remove if found


    private
            static        step1(string $word) : processed

Parameters

$word : string: the string to suffix removal

Return values

processed —

string

step2()

Verb Suffix Removal Step If step 1 does not change anything than this function will be called


    private
            static        step2(string $word) : processed

It will also check for longest suffix from the suffix set Remove if found

Parameters

$word : string: the string to suffix removal

Return values

processed —

string

step3()

Delete suffix i if in RV and preceded by c


    private
            static        step3(string $word) : processed

Parameters

$word : string: the string to suffix removal

Return values

processed —

string

step4()

Residual suffix If the word ends with one of [os a i o á í ó] in RV


    private
            static        step4(string $word) : processed

Parameters

$word : string: the string to suffix removal

Return values

processed —

string

step5()

Residual suffix If the word ends with one of [e é ê] in RV


    private
            static        step5(string $word) : processed

Parameters

$word : string: the string to suffix removal

Return values

processed —

string

Tokenizer in package Application

Tags

Table of Contents

Properties

$semantic_rewrites

$stop_words

Tags

$buffer

$k

$r1

$r2

$rv

Methods

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

findR1()

Parameters

Return values

findRV()

Parameters

Return values

mbStringToArray()

Parameters

Return values

step1()

Parameters

Return values

step2()

Parameters

Return values

step3()

Parameters

Return values

step4()

Parameters

Return values

step5()

Parameters

Return values

Tokenizer
in package

Application