Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

This class has a collection of methods for French locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org and was inspired by http://snowball.tartarus.org/otherlangs/french_javascript.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Tags
author

Chris Pollett

Table of Contents

$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
$buffer  : string
Storage used in computing the stem
$r1  : string
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
$r1_index  : int
Position in $word to stem of $r1
$r2  : string
$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
$r2_index  : int
Position in $word to stem of $r2
$rv  : string
$rv is approximately the string after the first vowel in the $word we want to stem
$rv_index  : int
Position in $word to stem of $rv
$vowel  : string
French vowels
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of a French word
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation)
computeNonVowelRegions()  : mixed
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
computeNonVowels()  : mixed
If a vowel shouldn't be treated as a volume it is capitalized by this method. (Operations done on buffer.)
step1()  : mixed
Standard suffix removal
step2a()  : mixed
Stem verb suffixes beginning i
step2b()  : mixed
Stem other verb suffixes
step3()  : mixed
Gets rid of cedille's (make c's) and words ending with Y (make i)
step4()  : mixed
If the word ends in an s, not preceded by a, i, o, u, è or s, delete it.
step5()  : mixed
Un-double letter end
step6()  : mixed
Un-accent end

Properties

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = ["titanic"]

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

public static mixed $stop_words = ['alors', 'au', 'aucuns', 'aussi', 'autre', 'avant', 'avec', 'avoir', 'bon', 'car', 'ce', 'cela', 'ces', 'ceux', 'chaque', 'ci', 'comme', 'comment', 'dans', 'des', 'du', 'dedans', 'dehors', 'depuis', 'deux', 'devrait', 'doit', 'donc', 'dos', 'droite', 'début', 'elle', 'elles', 'en', 'encore', 'essai', 'est', 'et', 'eu', 'fait', 'faites', 'fois', 'font', 'force', 'haut', 'hors', 'http', 'https', 'ici', 'il', 'ils', 'je', 'juste', 'la', 'le', 'les', 'leur', 'là', 'ma', 'maintenant', 'mais', 'mes', 'mine', 'moins', 'mon', 'mot', 'même', 'ni', 'nommés', 'notre', 'nous', 'nouveaux', 'ou', 'où', 'par', 'parce', 'parole', 'pas', 'personnes', 'peut', 'peu', 'pièce', 'plupart', 'pour', 'pourquoi', 'quand', 'que', 'quel', 'quelle', 'quelles', 'quels', 'qui', 'sa', 'sans', 'ses', 'seulement', 'si', 'sien', 'son', 'sont', 'sous', 'soyez', 'sujet', 'sur', 'ta', 'tandis', 'tellement', 'tels', 'tes', 'ton', 'tous', 'tout', 'trop', 'très', 'tu', 'valeur', 'voie', 'voient', 'vont', 'votre', 'vous', 'vu', 'ça', 'étaient', 'état', 'étions', 'été', 'être']
Tags
array

$buffer

Storage used in computing the stem

private static string $buffer

$r1

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

private static string $r1

$r1_index

Position in $word to stem of $r1

private static int $r1_index

$r2

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

private static string $r2

$r2_index

Position in $word to stem of $r2

private static int $r2_index

$rv

$rv is approximately the string after the first vowel in the $word we want to stem

private static string $rv

$rv_index

Position in $word to stem of $rv

private static int $rv_index

$vowel

French vowels

private static string $vowel = 'aeiouyàâëéèêïîôûù'

Methods

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a French word

public static stem(string $word) : string
Parameters
$word : string

the string to stem

Return values
string

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

computeNonVowelRegions()

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

private static computeNonVowelRegions() : mixed

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

Return values
mixed

computeNonVowels()

If a vowel shouldn't be treated as a volume it is capitalized by this method. (Operations done on buffer.)

private static computeNonVowels() : mixed
Return values
mixed

step1()

Standard suffix removal

private static step1() : mixed
Return values
mixed

step2a()

Stem verb suffixes beginning i

private static step2a(string $ori_word) : mixed
Parameters
$ori_word : string

original word before stemming

Return values
mixed

step2b()

Stem other verb suffixes

private static step2b() : mixed
Return values
mixed

step3()

Gets rid of cedille's (make c's) and words ending with Y (make i)

private static step3() : mixed
Return values
mixed

step4()

If the word ends in an s, not preceded by a, i, o, u, è or s, delete it.

private static step4() : mixed
Return values
mixed

step5()

Un-double letter end

private static step5() : mixed
Return values
mixed

step6()

Un-accent end

private static step6() : mixed
Return values
mixed

        

Search results