Tokenizer
in package
This class has a collection of methods for Dutch locale specific tokenization. In particular, it has a stemmer, .
Tags
Table of Contents
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $removed_e_suffix : array<string|int, mixed>
- boolean that tells the code if the e suffix was removed in step2 or not
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
- segment() : string
- Stub function which could be used for a word segmenter.
- stem() : string
- Computes the stem of a Dutch word
- step3b() : string
- Search for the longest among the following suffixes, and perform the action indicated.
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation and language detection)
- endsWith() : bool
- Checks to see if a string ends with a certain string
- getRIndex() : int
- Get the R index. The R index is the first consonent that follows a vowel after the $start index
- isVowel() : bool
- Check that the letter is a vowel
- removeAllUmlautAndAcuteAccents() : string
- Remove all umlaut and acute accents that need to be removed.
- replace() : string
- Replace a string based on a regex expression
- step1() : string
- Define a valid en-ending as a non-vowel, and not gem and remove it
- step2() : string
- Delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending
- step3a() : string
- Delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1
- step4() : string
- If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, moom -> mon, weed -> wed).
- substituteIAndY() : string
- Put initial y, y after a vowel, and i between vowels into upper case.
- undouble() : string
- undoubles the end of a string. If the string ends in kk, tt, dd remove one of the characters
Properties
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= ["abs", "ahs", "aken", "àlle", "als", "are", "allèen", "ate", "aten", "azen", "bse", "cfce", "curaçao", "dègelijk", "dme", "ede", "eden", "eds", "ehs", "ems", "ene", "epe", "eps", "ers", "eten", "ets", "even", "fme", "gedaçht", "ghe", "gve", "hdpe", "hôte", "hpe", "hse", "ibs", "ics", "ile", "ims", "jònge", "kwe", "ldpe", "lldpe", "lme", "lze", "maitres", "mwe", "nme", "ode", "ogen", "oke", "ole", "ons", "ònze", "open", "ops", "oren", "ors", "oss", "oven", "ows", "pre", "pve", "rhône", "ròme", "rwe", "ske", "sme", "spe", "ste", "the", "tje", "uce", "uden", "uien", "uren", "use", "uwe", "vse", "ype"]
$removed_e_suffix
boolean that tells the code if the e suffix was removed in step2 or not
public
static array<string|int, mixed>
$removed_e_suffix
= false
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
public
static mixed
$stop_words
= ['als', 'I', 'zijn', 'dat', 'hij', 'was', 'voor', 'op', 'zijn', 'met', 'ze', 'zijn', 'bij', 'een', 'hebben', 'deze', 'van', 'door', 'heet', 'woord', 'maar', 'wat', 'sommige', 'is', 'het', 'u', 'of', 'had', 'de', 'van', 'aan', 'en', 'een', 'in', 'we', 'kan', 'uit', 'andere', 'waren', 'die', 'doen', 'hun', 'tijd', 'indien', 'zal', 'hoe', 'zei', 'een', 'elk', 'vertellen', 'doet', 'set', 'drie', 'willen', 'lucht', 'goed', 'ook', 'spelen', 'klein', 'end', 'zetten', 'thuis', 'lezen', 'de hand', 'poort', 'grote', 'spell', 'toevoegen', 'zelfs', 'land', 'hier', 'moet', 'grote', 'hoog', 'dergelijke', 'volgen', 'act', 'waarom', 'vragen', 'mannen', 'verandering', 'ging', 'licht', 'soort', 'uitgeschakeld', 'nodig', 'huis', 'afbeelding', 'proberen', 'ons', 'weer', 'dier', 'punt', 'moeder', 'wereld', 'dichtbij', 'bouwen', 'zelf', 'aarde', 'vader']
Tags
Methods
segment()
Stub function which could be used for a word segmenter.
public
static segment(string $pre_segment) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
Return values
string —should return string with words separated by space in this case does nothing
stem()
Computes the stem of a Dutch word
public
static stem(string $word) : string
For example, lichamelijk, lichamelijke, lichamelijkheden and lichamen, all have licham as a stem
Parameters
- $word : string
-
the string to stem
Return values
string —the stem of $words
step3b()
Search for the longest among the following suffixes, and perform the action indicated.
public
static step3b(string $word, int $R2) : string
If in R2 and ends with eigend, eigingm igend or iging remove it If in R2 and ends with ig preceded by an e remove it If in R2 and ends with lijk, baar or bar then remove it
Parameters
- $word : string
-
the string to stem
- $R2 : int
-
the R index
Return values
string —the string with the various endings removed if they exist
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation and language detection)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
endsWith()
Checks to see if a string ends with a certain string
private
static endsWith(string $haystack, string $needle[, bool $case = true ]) : bool
Parameters
- $haystack : string
-
the string to check
- $needle : string
-
the string to match at the end
- $case : bool = true
-
whether the check should be case insensitive or not
Return values
bool —true if it ends with $needle, otherwise false
getRIndex()
Get the R index. The R index is the first consonent that follows a vowel after the $start index
private
static getRIndex(string $word, int $start) : int
Parameters
- $word : string
-
the string to search for the R index
- $start : int
-
the index to start searching for the R index in the string
Return values
int —the R index if found, otherwise -1
isVowel()
Check that the letter is a vowel
private
static isVowel(string $letter) : bool
Parameters
- $letter : string
-
the character to check
Return values
bool —true if it is a vowel, otherwise false
removeAllUmlautAndAcuteAccents()
Remove all umlaut and acute accents that need to be removed.
private
static removeAllUmlautAndAcuteAccents(string $word) : string
Parameters
- $word : string
-
the string to remove the umlauts and accents from
Return values
string —the string with the umlauts and accents removed
replace()
Replace a string based on a regex expression
private
static replace(string $word, string $regex, string $replace, int $offset) : string
Parameters
- $word : string
-
the string to search for regex replacement
- $regex : string
-
the regex to use to find and replacement
- $replace : string
-
the string to replace if the pattern is matched
- $offset : int
-
the int to start to look for the regex replacement
Return values
string —the string with the characters replaced if the regex matches, otherwise the original string
step1()
Define a valid en-ending as a non-vowel, and not gem and remove it
private
static step1(string $word, int $R1) : string
Parameters
- $word : string
-
the string to stem
- $R1 : int
-
the int that represents the R index
Return values
string —the string with the valid en-ending as a non-vowel, and not gem ending removed
step2()
Delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending
private
static step2(string $word) : string
Parameters
- $word : string
-
the string to delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending
Return values
string —the string with the suffix e if in R1 and preceded by a non-vowel deleted, and then undouble the ending
step3a()
Delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1
private
static step3a(string $word, int $R2) : string
Parameters
- $word : string
-
the string to delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1
- $R2 : int
-
the R index
Return values
string —the string with the letters heid if in R2 and not preceded by a c deleted, and treated an a preceding en like in step 1
step4()
If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, moom -> mon, weed -> wed).
private
static step4(string $word) : string
Parameters
- $word : string
-
the string to check for the CVD combination
Return values
string —the string with the CVD combination removed otherwise the original string
substituteIAndY()
Put initial y, y after a vowel, and i between vowels into upper case.
private
static substituteIAndY(string $word) : string
Parameters
- $word : string
-
the string to put initial y, y after a vowel, and i between vowels into upper case.
Return values
string —the string with an initial y, y after a vowel, and i between vowels into upper case.
undouble()
undoubles the end of a string. If the string ends in kk, tt, dd remove one of the characters
private
static undouble(string $word) : string
Parameters
- $word : string
-
the string to undouble
Return values
string —the undoubled string, otherwise the original string