
in package

This class has a collection of methods for Dutch locale specific tokenization. In particular, it has a stemmer, .


Charles Bocage

Table of Contents

$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$removed_e_suffix  : array<string|int, mixed>
boolean that tells the code if the e suffix was removed in step2 or not
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of a Dutch word
step3b()  : string
Search for the longest among the following suffixes, and perform the action indicated.
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation and language detection)
endsWith()  : bool
Checks to see if a string ends with a certain string
getRIndex()  : int
Get the R index. The R index is the first consonent that follows a vowel after the $start index
isVowel()  : bool
Check that the letter is a vowel
removeAllUmlautAndAcuteAccents()  : string
Remove all umlaut and acute accents that need to be removed.
replace()  : string
Replace a string based on a regex expression
step1()  : string
Define a valid en-ending as a non-vowel, and not gem and remove it
step2()  : string
Delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending
step3a()  : string
Delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1
step4()  : string
If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, moom -> mon, weed -> wed).
substituteIAndY()  : string
Put initial y, y after a vowel, and i between vowels into upper case.
undouble()  : string
undoubles the end of a string. If the string ends in kk, tt, dd remove one of the characters



Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = ["abs", "ahs", "aken", "àlle", "als", "are", "allèen", "ate", "aten", "azen", "bse", "cfce", "curaçao", "dègelijk", "dme", "ede", "eden", "eds", "ehs", "ems", "ene", "epe", "eps", "ers", "eten", "ets", "even", "fme", "gedaçht", "ghe", "gve", "hdpe", "hôte", "hpe", "hse", "ibs", "ics", "ile", "ims", "jònge", "kwe", "ldpe", "lldpe", "lme", "lze", "maitres", "mwe", "nme", "ode", "ogen", "oke", "ole", "ons", "ònze", "open", "ops", "oren", "ors", "oss", "oven", "ows", "pre", "pve", "rhône", "ròme", "rwe", "ske", "sme", "spe", "ste", "the", "tje", "uce", "uden", "uien", "uren", "use", "uwe", "vse", "ype"]


boolean that tells the code if the e suffix was removed in step2 or not

public static array<string|int, mixed> $removed_e_suffix = false


A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

public static mixed $stop_words = ['als', 'I', 'zijn', 'dat', 'hij', 'was', 'voor', 'op', 'zijn', 'met', 'ze', 'zijn', 'bij', 'een', 'hebben', 'deze', 'van', 'door', 'heet', 'woord', 'maar', 'wat', 'sommige', 'is', 'het', 'u', 'of', 'had', 'de', 'van', 'aan', 'en', 'een', 'in', 'we', 'kan', 'uit', 'andere', 'waren', 'die', 'doen', 'hun', 'tijd', 'indien', 'zal', 'hoe', 'zei', 'een', 'elk', 'vertellen', 'doet', 'set', 'drie', 'willen', 'lucht', 'goed', 'ook', 'spelen', 'klein', 'end', 'zetten', 'thuis', 'lezen', 'de hand', 'poort', 'grote', 'spell', 'toevoegen', 'zelfs', 'land', 'hier', 'moet', 'grote', 'hoog', 'dergelijke', 'volgen', 'act', 'waarom', 'vragen', 'mannen', 'verandering', 'ging', 'licht', 'soort', 'uitgeschakeld', 'nodig', 'huis', 'afbeelding', 'proberen', 'ons', 'weer', 'dier', 'punt', 'moeder', 'wereld', 'dichtbij', 'bouwen', 'zelf', 'aarde', 'vader']



Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

$pre_segment : string

before segmentation

Return values

should return string with words separated by space in this case does nothing


Computes the stem of a Dutch word

public static stem(string $word) : string

For example, lichamelijk, lichamelijke, lichamelijkheden and lichamen, all have licham as a stem

$word : string

the string to stem

Return values

the stem of $words


Search for the longest among the following suffixes, and perform the action indicated.

public static step3b(string $word, int $R2) : string

If in R2 and ends with eigend, eigingm igend or iging remove it If in R2 and ends with ig preceded by an e remove it If in R2 and ends with lijk, baar or bar then remove it

$word : string

the string to stem

$R2 : int

the R index

Return values

the string with the various endings removed if they exist


Removes the stop words from the page (used for Word Cloud generation and language detection)

public static stopwordsRemover(mixed $data) : mixed
$data : mixed

either a string or an array of string to remove stop words from

Return values

$data with no stop words


Checks to see if a string ends with a certain string

private static endsWith(string $haystack, string $needle[, bool $case = true ]) : bool
$haystack : string

the string to check

$needle : string

the string to match at the end

$case : bool = true

whether the check should be case insensitive or not

Return values

true if it ends with $needle, otherwise false


Get the R index. The R index is the first consonent that follows a vowel after the $start index

private static getRIndex(string $word, int $start) : int
$word : string

the string to search for the R index

$start : int

the index to start searching for the R index in the string

Return values

the R index if found, otherwise -1


Check that the letter is a vowel

private static isVowel(string $letter) : bool
$letter : string

the character to check

Return values

true if it is a vowel, otherwise false


Remove all umlaut and acute accents that need to be removed.

private static removeAllUmlautAndAcuteAccents(string $word) : string
$word : string

the string to remove the umlauts and accents from

Return values

the string with the umlauts and accents removed


Replace a string based on a regex expression

private static replace(string $word, string $regex, string $replace, int $offset) : string
$word : string

the string to search for regex replacement

$regex : string

the regex to use to find and replacement

$replace : string

the string to replace if the pattern is matched

$offset : int

the int to start to look for the regex replacement

Return values

the string with the characters replaced if the regex matches, otherwise the original string


Define a valid en-ending as a non-vowel, and not gem and remove it

private static step1(string $word, int $R1) : string
$word : string

the string to stem

$R1 : int

the int that represents the R index

Return values

the string with the valid en-ending as a non-vowel, and not gem ending removed


Delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending

private static step2(string $word) : string
$word : string

the string to delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending

Return values

the string with the suffix e if in R1 and preceded by a non-vowel deleted, and then undouble the ending


Delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1

private static step3a(string $word, int $R2) : string
$word : string

the string to delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1

$R2 : int

the R index

Return values

the string with the letters heid if in R2 and not preceded by a c deleted, and treated an a preceding en like in step 1


If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, moom -> mon, weed -> wed).

private static step4(string $word) : string
$word : string

the string to check for the CVD combination

Return values

the string with the CVD combination removed otherwise the original string


Put initial y, y after a vowel, and i between vowels into upper case.

private static substituteIAndY(string $word) : string
$word : string

the string to put initial y, y after a vowel, and i between vowels into upper case.

Return values

the string with an initial y, y after a vowel, and i between vowels into upper case.


undoubles the end of a string. If the string ends in kk, tt, dd remove one of the characters

private static undouble(string $word) : string
$word : string

the string to undouble

Return values

the undoubled string, otherwise the original string


Search results