Yioop_V9.5_Source_Code

Tokenizer
in package

Application

This class has a collection of methods for Dutch locale specific tokenization. In particular, it has a stemmer, .

$no_stem_list

Words we don't want to be stemmed


    public
    static    array<string|int, mixed>
    $no_stem_list
     = ["abs", "ahs", "aken", "àlle", "als", "are", "allèen", "ate", "aten", "azen", "bse", "cfce", "curaçao", "dègelijk", "dme", "ede", "eden", "eds", "ehs", "ems", "ene", "epe", "eps", "ers", "eten", "ets", "even", "fme", "gedaçht", "ghe", "gve", "hdpe", "hôte", "hpe", "hse", "ibs", "ics", "ile", "ims", "jònge", "kwe", "ldpe", "lldpe", "lme", "lze", "maitres", "mwe", "nme", "ode", "ogen", "oke", "ole", "ons", "ònze", "open", "ops", "oren", "ors", "oss", "oven", "ows", "pre", "pve", "rhône", "ròme", "rwe", "ske", "sme", "spe", "ste", "the", "tje", "uce", "uden", "uien", "uren", "use", "uwe", "vse", "ype"]

$removed_e_suffix

boolean that tells the code if the e suffix was removed in step2 or not


    public
    static    array<string|int, mixed>
    $removed_e_suffix
     = false

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection


    public
    static    mixed
    $stop_words
     = ['als', 'I', 'zijn', 'dat', 'hij', 'was', 'voor', 'op', 'zijn', 'met', 'ze', 'zijn', 'bij', 'een', 'hebben', 'deze', 'van', 'door', 'heet', 'woord', 'maar', 'wat', 'sommige', 'is', 'het', 'u', 'of', 'had', 'de', 'van', 'aan', 'en', 'een', 'in', 'we', 'kan', 'uit', 'andere', 'waren', 'die', 'doen', 'hun', 'tijd', 'indien', 'zal', 'hoe', 'zei', 'een', 'elk', 'vertellen', 'doet', 'set', 'drie', 'willen', 'lucht', 'goed', 'ook', 'spelen', 'klein', 'end', 'zetten', 'thuis', 'lezen', 'de hand', 'poort', 'grote', 'spell', 'toevoegen', 'zelfs', 'land', 'hier', 'moet', 'grote', 'hoog', 'dergelijke', 'volgen', 'act', 'waarom', 'vragen', 'mannen', 'verandering', 'ging', 'licht', 'soort', 'uitgeschakeld', 'nodig', 'huis', 'afbeelding', 'proberen', 'ons', 'weer', 'dier', 'punt', 'moeder', 'wereld', 'dichtbij', 'bouwen', 'zelf', 'aarde', 'vader']

segment()

Stub function which could be used for a word segmenter.


    public
            static        segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

$pre_segment : string: before segmentation

Return values

string —

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a Dutch word


    public
            static        stem(string $word) : string

For example, lichamelijk, lichamelijke, lichamelijkheden and lichamen, all have licham as a stem

Parameters

$word : string: the string to stem

Return values

string —

the stem of $words

step3b()

Search for the longest among the following suffixes, and perform the action indicated.


    public
            static        step3b(string $word, int $R2) : string

If in R2 and ends with eigend, eigingm igend or iging remove it If in R2 and ends with ig preceded by an e remove it If in R2 and ends with lijk, baar or bar then remove it

Parameters

$word : string: the string to stem
$R2 : int: the R index

Return values

string —

the string with the various endings removed if they exist

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

endsWith()

Checks to see if a string ends with a certain string


    private
            static        endsWith(string $haystack, string $needle[, bool $case = true ]) : bool

Parameters

$haystack : string: the string to check
$needle : string: the string to match at the end
$case : bool = true: whether the check should be case insensitive or not

Return values

bool —

true if it ends with $needle, otherwise false

getRIndex()

Get the R index. The R index is the first consonent that follows a vowel after the $start index


    private
            static        getRIndex(string $word, int $start) : int

Parameters

$word : string: the string to search for the R index
$start : int: the index to start searching for the R index in the string

Return values

int —

the R index if found, otherwise -1

isVowel()

Check that the letter is a vowel


    private
            static        isVowel(string $letter) : bool

Parameters

$letter : string: the character to check

Return values

bool —

true if it is a vowel, otherwise false

removeAllUmlautAndAcuteAccents()

Remove all umlaut and acute accents that need to be removed.


    private
            static        removeAllUmlautAndAcuteAccents(string $word) : string

Parameters

$word : string: the string to remove the umlauts and accents from

Return values

string —

the string with the umlauts and accents removed

replace()

Replace a string based on a regex expression


    private
            static        replace(string $word, string $regex, string $replace, int $offset) : string

Parameters

$word : string: the string to search for regex replacement
$regex : string: the regex to use to find and replacement
$replace : string: the string to replace if the pattern is matched
$offset : int: the int to start to look for the regex replacement

Return values

string —

the string with the characters replaced if the regex matches, otherwise the original string

step1()

Define a valid en-ending as a non-vowel, and not gem and remove it


    private
            static        step1(string $word, int $R1) : string

Parameters

$word : string: the string to stem
$R1 : int: the int that represents the R index

Return values

string —

the string with the valid en-ending as a non-vowel, and not gem ending removed

step2()

Delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending


    private
            static        step2(string $word) : string

Parameters

$word : string: the string to delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending

Return values

string —

the string with the suffix e if in R1 and preceded by a non-vowel deleted, and then undouble the ending

step3a()

Delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1


    private
            static        step3a(string $word, int $R2) : string

Parameters

$word : string: the string to delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1
$R2 : int: the R index

Return values

string —

the string with the letters heid if in R2 and not preceded by a c deleted, and treated an a preceding en like in step 1

step4()

If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, moom -> mon, weed -> wed).


    private
            static        step4(string $word) : string

Parameters

$word : string: the string to check for the CVD combination

Return values

string —

the string with the CVD combination removed otherwise the original string

substituteIAndY()

Put initial y, y after a vowel, and i between vowels into upper case.


    private
            static        substituteIAndY(string $word) : string

Parameters

$word : string: the string to put initial y, y after a vowel, and i between vowels into upper case.

Return values

string —

the string with an initial y, y after a vowel, and i between vowels into upper case.

undouble()

undoubles the end of a string. If the string ends in kk, tt, dd remove one of the characters


    private
            static        undouble(string $word) : string

Parameters

$word : string: the string to undouble

Return values

string —

the undoubled string, otherwise the original string

Tokenizer in package Application

Tags

Table of Contents

Properties

$no_stem_list

$removed_e_suffix

$stop_words

Tags

Methods

segment()

Parameters

Return values

stem()

Parameters

Return values

step3b()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

endsWith()

Parameters

Return values

getRIndex()

Parameters

Return values

isVowel()

Parameters

Return values

removeAllUmlautAndAcuteAccents()

Parameters

Return values

replace()

Parameters

Return values

step1()

Parameters

Return values

step2()

Parameters

Return values

step3a()

Parameters

Return values

step4()

Parameters

Return values

substituteIAndY()

Parameters

Return values

undouble()

Parameters

Return values

Tokenizer
in package

Application