Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

This class has a collection of methods for Dutch locale specific tokenization. In particular, it has a stemmer, .

Tags
author

Charles Bocage

Table of Contents

$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$removed_e_suffix  : array<string|int, mixed>
boolean that tells the code if the e suffix was removed in step2 or not
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of a Dutch word
step3b()  : string
Search for the longest among the following suffixes, and perform the action indicated.
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation and language detection)
endsWith()  : bool
Checks to see if a string ends with a certain string
getRIndex()  : int
Get the R index. The R index is the first consonent that follows a vowel after the $start index
isVowel()  : bool
Check that the letter is a vowel
removeAllUmlautAndAcuteAccents()  : string
Remove all umlaut and acute accents that need to be removed.
replace()  : string
Replace a string based on a regex expression
step1()  : string
Define a valid en-ending as a non-vowel, and not gem and remove it
step2()  : string
Delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending
step3a()  : string
Delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1
step4()  : string
If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, moom -> mon, weed -> wed).
substituteIAndY()  : string
Put initial y, y after a vowel, and i between vowels into upper case.
undouble()  : string
undoubles the end of a string. If the string ends in kk, tt, dd remove one of the characters

Properties

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = ["abs", "ahs", "aken", "àlle", "als", "are", "allèen", "ate", "aten", "azen", "bse", "cfce", "curaçao", "dègelijk", "dme", "ede", "eden", "eds", "ehs", "ems", "ene", "epe", "eps", "ers", "eten", "ets", "even", "fme", "gedaçht", "ghe", "gve", "hdpe", "hôte", "hpe", "hse", "ibs", "ics", "ile", "ims", "jònge", "kwe", "ldpe", "lldpe", "lme", "lze", "maitres", "mwe", "nme", "ode", "ogen", "oke", "ole", "ons", "ònze", "open", "ops", "oren", "ors", "oss", "oven", "ows", "pre", "pve", "rhône", "ròme", "rwe", "ske", "sme", "spe", "ste", "the", "tje", "uce", "uden", "uien", "uren", "use", "uwe", "vse", "ype"]

$removed_e_suffix

boolean that tells the code if the e suffix was removed in step2 or not

public static array<string|int, mixed> $removed_e_suffix = false

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

public static mixed $stop_words = ['als', 'I', 'zijn', 'dat', 'hij', 'was', 'voor', 'op', 'zijn', 'met', 'ze', 'zijn', 'bij', 'een', 'hebben', 'deze', 'van', 'door', 'heet', 'woord', 'maar', 'wat', 'sommige', 'is', 'het', 'u', 'of', 'had', 'de', 'van', 'aan', 'en', 'een', 'in', 'we', 'kan', 'uit', 'andere', 'waren', 'die', 'doen', 'hun', 'tijd', 'indien', 'zal', 'hoe', 'zei', 'een', 'elk', 'vertellen', 'doet', 'set', 'drie', 'willen', 'lucht', 'goed', 'ook', 'spelen', 'klein', 'end', 'zetten', 'thuis', 'lezen', 'de hand', 'poort', 'grote', 'spell', 'toevoegen', 'zelfs', 'land', 'hier', 'moet', 'grote', 'hoog', 'dergelijke', 'volgen', 'act', 'waarom', 'vragen', 'mannen', 'verandering', 'ging', 'licht', 'soort', 'uitgeschakeld', 'nodig', 'huis', 'afbeelding', 'proberen', 'ons', 'weer', 'dier', 'punt', 'moeder', 'wereld', 'dichtbij', 'bouwen', 'zelf', 'aarde', 'vader']
Tags
array

Methods

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a Dutch word

public static stem(string $word) : string

For example, lichamelijk, lichamelijke, lichamelijkheden and lichamen, all have licham as a stem

Parameters
$word : string

the string to stem

Return values
string

the stem of $words

step3b()

Search for the longest among the following suffixes, and perform the action indicated.

public static step3b(string $word, int $R2) : string

If in R2 and ends with eigend, eigingm igend or iging remove it If in R2 and ends with ig preceded by an e remove it If in R2 and ends with lijk, baar or bar then remove it

Parameters
$word : string

the string to stem

$R2 : int

the R index

Return values
string

the string with the various endings removed if they exist

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

endsWith()

Checks to see if a string ends with a certain string

private static endsWith(string $haystack, string $needle[, bool $case = true ]) : bool
Parameters
$haystack : string

the string to check

$needle : string

the string to match at the end

$case : bool = true

whether the check should be case insensitive or not

Return values
bool

true if it ends with $needle, otherwise false

getRIndex()

Get the R index. The R index is the first consonent that follows a vowel after the $start index

private static getRIndex(string $word, int $start) : int
Parameters
$word : string

the string to search for the R index

$start : int

the index to start searching for the R index in the string

Return values
int

the R index if found, otherwise -1

isVowel()

Check that the letter is a vowel

private static isVowel(string $letter) : bool
Parameters
$letter : string

the character to check

Return values
bool

true if it is a vowel, otherwise false

removeAllUmlautAndAcuteAccents()

Remove all umlaut and acute accents that need to be removed.

private static removeAllUmlautAndAcuteAccents(string $word) : string
Parameters
$word : string

the string to remove the umlauts and accents from

Return values
string

the string with the umlauts and accents removed

replace()

Replace a string based on a regex expression

private static replace(string $word, string $regex, string $replace, int $offset) : string
Parameters
$word : string

the string to search for regex replacement

$regex : string

the regex to use to find and replacement

$replace : string

the string to replace if the pattern is matched

$offset : int

the int to start to look for the regex replacement

Return values
string

the string with the characters replaced if the regex matches, otherwise the original string

step1()

Define a valid en-ending as a non-vowel, and not gem and remove it

private static step1(string $word, int $R1) : string
Parameters
$word : string

the string to stem

$R1 : int

the int that represents the R index

Return values
string

the string with the valid en-ending as a non-vowel, and not gem ending removed

step2()

Delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending

private static step2(string $word) : string
Parameters
$word : string

the string to delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending

Return values
string

the string with the suffix e if in R1 and preceded by a non-vowel deleted, and then undouble the ending

step3a()

Delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1

private static step3a(string $word, int $R2) : string
Parameters
$word : string

the string to delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1

$R2 : int

the R index

Return values
string

the string with the letters heid if in R2 and not preceded by a c deleted, and treated an a preceding en like in step 1

step4()

If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, moom -> mon, weed -> wed).

private static step4(string $word) : string
Parameters
$word : string

the string to check for the CVD combination

Return values
string

the string with the CVD combination removed otherwise the original string

substituteIAndY()

Put initial y, y after a vowel, and i between vowels into upper case.

private static substituteIAndY(string $word) : string
Parameters
$word : string

the string to put initial y, y after a vowel, and i between vowels into upper case.

Return values
string

the string with an initial y, y after a vowel, and i between vowels into upper case.

undouble()

undoubles the end of a string. If the string ends in kk, tt, dd remove one of the characters

private static undouble(string $word) : string
Parameters
$word : string

the string to undouble

Return values
string

the undoubled string, otherwise the original string


        

Search results