Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

German specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

This class has a collection of methods for German locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org/algorithms/german/stemmer.html Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Tags
author

Chris Pollett

Table of Contents

$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
$buffer  : string
Storage used in computing the stem
$r1  : string
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
$r1_index  : int
Position in $word to stem of $r1
$r2  : string
$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
$r2_index  : int
Position in $word to stem of $r2
$s_ending  : string
Things that might have an s following them
$st_ending  : string
Things that might have an st following them
$vowel  : string
German vowels
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of a German word
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation and language detection)
backwardSuffix()  : mixed
Used to strip suffixes off word
markRegions()  : mixed
Computes locations of rv - RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. , r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel and R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel.
postlude()  : mixed
Convert captitalized U and Y back to lower-case get rid of any dots above vowels
prelude()  : mixed
Upper u and y between vowels so won't be treated as a vowel for the purpose of this algorithm. Maps ß to ss.

Properties

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = ["titanic"]

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

public static mixed $stop_words = ['aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'as', 'also', 'am', 'an', 'ander', 'andere', 'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf', 'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das', 'daß', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe', 'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du', 'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es', 'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat', 'hatte', 'hatten', 'hier', 'hin', 'hinter', 'http', 'https', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres', 'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener', 'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche', 'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach', 'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst', 'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über', 'um', 'und', 'uns', 'unse', 'unsem', 'unsen', 'unser', 'unses', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst', 'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will', 'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen']
Tags
array

$buffer

Storage used in computing the stem

private static string $buffer

$r1

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

private static string $r1

$r1_index

Position in $word to stem of $r1

private static int $r1_index

$r2

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

private static string $r2

$r2_index

Position in $word to stem of $r2

private static int $r2_index

$s_ending

Things that might have an s following them

private static string $s_ending = 'bdfghklmnrt'

$st_ending

Things that might have an st following them

private static string $st_ending = 'bdfghklmnt'

$vowel

German vowels

private static string $vowel = 'aeiouyäöü'

Methods

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a German word

public static stem(string $word) : string
Parameters
$word : string

the string to stem

Return values
string

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

backwardSuffix()

Used to strip suffixes off word

private static backwardSuffix() : mixed
Return values
mixed

markRegions()

Computes locations of rv - RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. , r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel and R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel.

private static markRegions() : mixed
Return values
mixed

postlude()

Convert captitalized U and Y back to lower-case get rid of any dots above vowels

private static postlude() : mixed
Return values
mixed

prelude()

Upper u and y between vowels so won't be treated as a vowel for the purpose of this algorithm. Maps ß to ss.

private static prelude() : mixed
Return values
mixed

        

Search results