Yioop_V9.5_Source_Code

Tokenizer
in package

Application

This class has a collection of methods for French locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org and was inspired by http://snowball.tartarus.org/otherlangs/french_javascript.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

$no_stem_list

Words we don't want to be stemmed


    public
    static    array<string|int, mixed>
    $no_stem_list
     = ["titanic"]

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries


    public
    static    mixed
    $stop_words
     = ['alors', 'au', 'aucuns', 'aussi', 'autre', 'avant', 'avec', 'avoir', 'bon', 'car', 'ce', 'cela', 'ces', 'ceux', 'chaque', 'ci', 'comme', 'comment', 'dans', 'des', 'du', 'dedans', 'dehors', 'depuis', 'deux', 'devrait', 'doit', 'donc', 'dos', 'droite', 'début', 'elle', 'elles', 'en', 'encore', 'essai', 'est', 'et', 'eu', 'fait', 'faites', 'fois', 'font', 'force', 'haut', 'hors', 'http', 'https', 'ici', 'il', 'ils', 'je', 'juste', 'la', 'le', 'les', 'leur', 'là', 'ma', 'maintenant', 'mais', 'mes', 'mine', 'moins', 'mon', 'mot', 'même', 'ni', 'nommés', 'notre', 'nous', 'nouveaux', 'ou', 'où', 'par', 'parce', 'parole', 'pas', 'personnes', 'peut', 'peu', 'pièce', 'plupart', 'pour', 'pourquoi', 'quand', 'que', 'quel', 'quelle', 'quelles', 'quels', 'qui', 'sa', 'sans', 'ses', 'seulement', 'si', 'sien', 'son', 'sont', 'sous', 'soyez', 'sujet', 'sur', 'ta', 'tandis', 'tellement', 'tels', 'tes', 'ton', 'tous', 'tout', 'trop', 'très', 'tu', 'valeur', 'voie', 'voient', 'vont', 'votre', 'vous', 'vu', 'ça', 'étaient', 'état', 'étions', 'été', 'être']

$buffer

Storage used in computing the stem


    private
    static    string
    $buffer

$r1

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.


    private
    static    string
    $r1

$r1_index

Position in $word to stem of $r1


    private
    static    int
    $r1_index

$r2

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel


    private
    static    string
    $r2

$r2_index

Position in $word to stem of $r2


    private
    static    int
    $r2_index

$rv

$rv is approximately the string after the first vowel in the $word we want to stem


    private
    static    string
    $rv

$rv_index

Position in $word to stem of $rv


    private
    static    int
    $rv_index

$vowel

French vowels


    private
    static    string
    $vowel
     = 'aeiouyàâëéèêïîôûù'

segment()

Stub function which could be used for a word segmenter.


    public
            static        segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

$pre_segment : string: before segmentation

Return values

string —

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a French word


    public
            static        stem(string $word) : string

Parameters

$word : string: the string to stem

Return values

string —

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

computeNonVowelRegions()

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.


    private
            static        computeNonVowelRegions() : mixed

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

Return values

mixed —

computeNonVowels()

If a vowel shouldn't be treated as a volume it is capitalized by this method. (Operations done on buffer.)


    private
            static        computeNonVowels() : mixed

Return values

mixed —

step1()

Standard suffix removal


    private
            static        step1() : mixed

Return values

mixed —

step2a()

Stem verb suffixes beginning i


    private
            static        step2a(string $ori_word) : mixed

Parameters

$ori_word : string: original word before stemming

Return values

mixed —

step2b()

Stem other verb suffixes


    private
            static        step2b() : mixed

Return values

mixed —

step3()

Gets rid of cedille's (make c's) and words ending with Y (make i)


    private
            static        step3() : mixed

Return values

mixed —

step4()

If the word ends in an s, not preceded by a, i, o, u, è or s, delete it.


    private
            static        step4() : mixed

Return values

mixed —

step5()

Un-double letter end


    private
            static        step5() : mixed

Return values

mixed —

step6()

Un-accent end


    private
            static        step6() : mixed

Return values

mixed —

Tokenizer in package Application

Tags

Table of Contents

Properties

$no_stem_list

$stop_words

Tags

$buffer

$r1

$r1_index

$r2

$r2_index

$rv

$rv_index

$vowel

Methods

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

computeNonVowelRegions()

Return values

computeNonVowels()

Return values

step1()

Return values

step2a()

Parameters

Return values

step2b()

Return values

step3()

Return values

step4()

Return values

step5()

Return values

step6()

Return values

Tokenizer
in package

Application