Tokenizer
in package
This class has a collection of methods for French locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org and was inspired by http://snowball.tartarus.org/otherlangs/french_javascript.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
Tags
Table of Contents
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
- $buffer : string
- Storage used in computing the stem
- $r1 : string
- $r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
- $r1_index : int
- Position in $word to stem of $r1
- $r2 : string
- $r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
- $r2_index : int
- Position in $word to stem of $r2
- $rv : string
- $rv is approximately the string after the first vowel in the $word we want to stem
- $rv_index : int
- Position in $word to stem of $rv
- $vowel : string
- French vowels
- segment() : string
- Stub function which could be used for a word segmenter.
- stem() : string
- Computes the stem of a French word
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation)
- computeNonVowelRegions() : mixed
- $r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
- computeNonVowels() : mixed
- If a vowel shouldn't be treated as a volume it is capitalized by this method. (Operations done on buffer.)
- step1() : mixed
- Standard suffix removal
- step2a() : mixed
- Stem verb suffixes beginning i
- step2b() : mixed
- Stem other verb suffixes
- step3() : mixed
- Gets rid of cedille's (make c's) and words ending with Y (make i)
- step4() : mixed
- If the word ends in an s, not preceded by a, i, o, u, è or s, delete it.
- step5() : mixed
- Un-double letter end
- step6() : mixed
- Un-accent end
Properties
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= ["titanic"]
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
public
static mixed
$stop_words
= ['alors', 'au', 'aucuns', 'aussi', 'autre', 'avant', 'avec', 'avoir', 'bon', 'car', 'ce', 'cela', 'ces', 'ceux', 'chaque', 'ci', 'comme', 'comment', 'dans', 'des', 'du', 'dedans', 'dehors', 'depuis', 'deux', 'devrait', 'doit', 'donc', 'dos', 'droite', 'début', 'elle', 'elles', 'en', 'encore', 'essai', 'est', 'et', 'eu', 'fait', 'faites', 'fois', 'font', 'force', 'haut', 'hors', 'http', 'https', 'ici', 'il', 'ils', 'je', 'juste', 'la', 'le', 'les', 'leur', 'là', 'ma', 'maintenant', 'mais', 'mes', 'mine', 'moins', 'mon', 'mot', 'même', 'ni', 'nommés', 'notre', 'nous', 'nouveaux', 'ou', 'où', 'par', 'parce', 'parole', 'pas', 'personnes', 'peut', 'peu', 'pièce', 'plupart', 'pour', 'pourquoi', 'quand', 'que', 'quel', 'quelle', 'quelles', 'quels', 'qui', 'sa', 'sans', 'ses', 'seulement', 'si', 'sien', 'son', 'sont', 'sous', 'soyez', 'sujet', 'sur', 'ta', 'tandis', 'tellement', 'tels', 'tes', 'ton', 'tous', 'tout', 'trop', 'très', 'tu', 'valeur', 'voie', 'voient', 'vont', 'votre', 'vous', 'vu', 'ça', 'étaient', 'état', 'étions', 'été', 'être']
Tags
$buffer
Storage used in computing the stem
private
static string
$buffer
$r1
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
private
static string
$r1
$r1_index
Position in $word to stem of $r1
private
static int
$r1_index
$r2
$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
private
static string
$r2
$r2_index
Position in $word to stem of $r2
private
static int
$r2_index
$rv
$rv is approximately the string after the first vowel in the $word we want to stem
private
static string
$rv
$rv_index
Position in $word to stem of $rv
private
static int
$rv_index
$vowel
French vowels
private
static string
$vowel
= 'aeiouyàâëéèêïîôûù'
Methods
segment()
Stub function which could be used for a word segmenter.
public
static segment(string $pre_segment) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
Return values
string —should return string with words separated by space in this case does nothing
stem()
Computes the stem of a French word
public
static stem(string $word) : string
Parameters
- $word : string
-
the string to stem
Return values
string —the stem of $words
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
computeNonVowelRegions()
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
private
static computeNonVowelRegions() : mixed
$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
Return values
mixed —computeNonVowels()
If a vowel shouldn't be treated as a volume it is capitalized by this method. (Operations done on buffer.)
private
static computeNonVowels() : mixed
Return values
mixed —step1()
Standard suffix removal
private
static step1() : mixed
Return values
mixed —step2a()
Stem verb suffixes beginning i
private
static step2a(string $ori_word) : mixed
Parameters
- $ori_word : string
-
original word before stemming
Return values
mixed —step2b()
Stem other verb suffixes
private
static step2b() : mixed
Return values
mixed —step3()
Gets rid of cedille's (make c's) and words ending with Y (make i)
private
static step3() : mixed
Return values
mixed —step4()
If the word ends in an s, not preceded by a, i, o, u, è or s, delete it.
private
static step4() : mixed
Return values
mixed —step5()
Un-double letter end
private
static step5() : mixed
Return values
mixed —step6()
Un-accent end
private
static step6() : mixed