Yioop_V9.5_Source_Code

Tokenizer
in package

Application

German specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

This class has a collection of methods for German locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org/algorithms/german/stemmer.html Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

$no_stem_list

Words we don't want to be stemmed


    public
    static    array<string|int, mixed>
    $no_stem_list
     = ["titanic"]

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection


    public
    static    mixed
    $stop_words
     = ['aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'as', 'also', 'am', 'an', 'ander', 'andere', 'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf', 'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das', 'daß', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe', 'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du', 'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es', 'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat', 'hatte', 'hatten', 'hier', 'hin', 'hinter', 'http', 'https', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres', 'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener', 'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche', 'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach', 'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst', 'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über', 'um', 'und', 'uns', 'unse', 'unsem', 'unsen', 'unser', 'unses', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst', 'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will', 'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen']

$buffer

Storage used in computing the stem


    private
    static    string
    $buffer

$r1

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.


    private
    static    string
    $r1

$r1_index

Position in $word to stem of $r1


    private
    static    int
    $r1_index

$r2

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel


    private
    static    string
    $r2

$r2_index

Position in $word to stem of $r2


    private
    static    int
    $r2_index

$s_ending

Things that might have an s following them


    private
    static    string
    $s_ending
     = 'bdfghklmnrt'

$st_ending

Things that might have an st following them


    private
    static    string
    $st_ending
     = 'bdfghklmnt'

$vowel

German vowels


    private
    static    string
    $vowel
     = 'aeiouyäöü'

segment()

Stub function which could be used for a word segmenter.


    public
            static        segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

$pre_segment : string: before segmentation

Return values

string —

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a German word


    public
            static        stem(string $word) : string

Parameters

$word : string: the string to stem

Return values

string —

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

backwardSuffix()

Used to strip suffixes off word


    private
            static        backwardSuffix() : mixed

Return values

mixed —

markRegions()

Computes locations of rv - RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. , r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel and R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel.


    private
            static        markRegions() : mixed

Return values

mixed —

postlude()

Convert captitalized U and Y back to lower-case get rid of any dots above vowels


    private
            static        postlude() : mixed

Return values

mixed —

prelude()

Upper u and y between vowels so won't be treated as a vowel for the purpose of this algorithm. Maps ß to ss.


    private
            static        prelude() : mixed

Return values

mixed —

Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Application

Tags

Table of Contents

Properties

$no_stem_list

$stop_words

Tags

$buffer

$r1

$r1_index

$r2

$r2_index

$s_ending

$st_ending

$vowel

Methods

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

backwardSuffix()

Return values

markRegions()

Return values

postlude()

Return values

prelude()

Return values

Search results

Tokenizer in package Application

Tags

Table of Contents

Properties

$no_stem_list

$stop_words

Tags

$buffer

$r1

$r1_index

$r2

$r2_index

$s_ending

$st_ending

$vowel

Methods

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

backwardSuffix()

Return values

markRegions()

Return values

postlude()

Return values

prelude()

Return values

Tokenizer
in package

Application