Tokenizer
in package
German specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram
This class has a collection of methods for German locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org/algorithms/german/stemmer.html Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
Tags
Table of Contents
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
- $buffer : string
- Storage used in computing the stem
- $r1 : string
- $r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
- $r1_index : int
- Position in $word to stem of $r1
- $r2 : string
- $r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
- $r2_index : int
- Position in $word to stem of $r2
- $s_ending : string
- Things that might have an s following them
- $st_ending : string
- Things that might have an st following them
- $vowel : string
- German vowels
- segment() : string
- Stub function which could be used for a word segmenter.
- stem() : string
- Computes the stem of a German word
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation and language detection)
- backwardSuffix() : mixed
- Used to strip suffixes off word
- markRegions() : mixed
- Computes locations of rv - RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. , r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel and R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel.
- postlude() : mixed
- Convert captitalized U and Y back to lower-case get rid of any dots above vowels
- prelude() : mixed
- Upper u and y between vowels so won't be treated as a vowel for the purpose of this algorithm. Maps ß to ss.
Properties
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= ["titanic"]
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
public
static mixed
$stop_words
= ['aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'as', 'also', 'am', 'an', 'ander', 'andere', 'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf', 'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das', 'daß', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe', 'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du', 'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es', 'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat', 'hatte', 'hatten', 'hier', 'hin', 'hinter', 'http', 'https', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres', 'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener', 'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche', 'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach', 'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst', 'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über', 'um', 'und', 'uns', 'unse', 'unsem', 'unsen', 'unser', 'unses', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst', 'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will', 'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen']
Tags
$buffer
Storage used in computing the stem
private
static string
$buffer
$r1
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
private
static string
$r1
$r1_index
Position in $word to stem of $r1
private
static int
$r1_index
$r2
$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
private
static string
$r2
$r2_index
Position in $word to stem of $r2
private
static int
$r2_index
$s_ending
Things that might have an s following them
private
static string
$s_ending
= 'bdfghklmnrt'
$st_ending
Things that might have an st following them
private
static string
$st_ending
= 'bdfghklmnt'
$vowel
German vowels
private
static string
$vowel
= 'aeiouyäöü'
Methods
segment()
Stub function which could be used for a word segmenter.
public
static segment(string $pre_segment) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
Return values
string —should return string with words separated by space in this case does nothing
stem()
Computes the stem of a German word
public
static stem(string $word) : string
Parameters
- $word : string
-
the string to stem
Return values
string —the stem of $words
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation and language detection)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
backwardSuffix()
Used to strip suffixes off word
private
static backwardSuffix() : mixed
Return values
mixed —markRegions()
Computes locations of rv - RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. , r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel and R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel.
private
static markRegions() : mixed
Return values
mixed —postlude()
Convert captitalized U and Y back to lower-case get rid of any dots above vowels
private
static postlude() : mixed
Return values
mixed —prelude()
Upper u and y between vowels so won't be treated as a vowel for the purpose of this algorithm. Maps ß to ss.
private
static prelude() : mixed