Yioop_V9.5_Source_Code

Tokenizer
in package

Application

Greek specific tokenization code. Contains a list of greek stop words used in making word clouds. It also has a greek stemmer.

This stemmer is based on the algorithms described in Ntais, Georgios. Development of a Stemmer for the Greek Language. Diss. Royal Institute of Technology, 2006. and Saroukos, Spyridon. Enhancing a Greek language stemmer. University of Tampere, 2008. From here I looked at the implementation given at: https://snowballstem.org/algorithms/greek/stemmer.html In particular, I looked at the Snowball code, the Javascript Demo code, and the PHP code (GPLv3) in: https://git.drupalcode.org/project/greekstemmer Copyright (c) 2009 Vassilis Spiliopoulos (http://www.psychfamily.gr) Updated by Yannis Karampelas (info@netstudio.gr) in 2011 and 2017 respectively based on earlier work Spyros Saroukos into Drupal CMS.

The code below is largely a complete rewrite to make this work in UTF-8 lower case Greek rather than use upper case iso-8859-7 as the file encoding. Most of the repetitive code has been refactored into a method regexStem which is repeatedly called with different regex expressions.

$dictionary_stems

This is a list of hard-coded stems. I got the test file (90000 plus terms) on the snowball site to work except for this list, so I brute forced it. My suspicion why all cases didn't work is something to do with my diacritic mark handling.


    public
    static    array<string|int, mixed>
    $dictionary_stems
     = ["αιθυλεστέρας" => "αιθυλ", "αμόρφωτος" => "αμορφω", "ανανεώθηκαν" => "αν", "ανανεώθηκε" => "αν", "αντιστρόφως" => "αντιστροφω", "ανωτέρας" => "αν", "αριστεράς" => "αριστερ", "ασαφώς" => "ασαφω", "αστέρας" => "αστερ", "βαλίτσα" => "βαλιτσ", "βαλίτσας" => "βαλιτσ", "βαλίτσες" => "βαλιτσ", "γεγονος" => "γεγον", "γεγονός" => "γεγον", "γεγονότος" => "γεγον", "γιαγιάδες" => "γιαγ", "δευτέρας" => "δε", "διαιωνίζει" => "διαι", "διαιωνίζουν" => "διαι", "εγγράφως" => "εγγραφω", "επτάφωτος" => "επταφω", "εσπέρας" => "εσπερ", "εσωτερισμού" => "εσ", "θυγατέρας" => "θυγατερ", "ισοπροπυλεστέρας" => "ισοπροπυλ", "καθεστώς" => "καθεστ", "καθεστώτος" => "καθεστ", "καλντέρας" => "καλντερ", "κεντροαριστεράς" => "κεντροαριστερ", "κρέας" => "κρε", "κρέατος" => "κρε", "κυράδες" => "κυρ", "λυκόφως" => "λυκοφω", "λυκόφωτος" => "λυκοφω", "μαμάδες" => "μαμ", "μεθυλεστέρας" => "μεθυλ", "μητέρας" => "μητερ", "νεωτέρας" => "νε", "νεωτερισμοί" => "νε", "νεωτερισμούς" => "νε", "νεωτερισμό" => "νε", "νεωτερισμός" => "νε", "νεωτεριστές" => "νε", "νεωτεριστής" => "νε", "νταντάδες" => "νταντ", "νυφίτσα" => "νυφιτσ", "νυφίτσες" => "νυφιτσ", "οκάδες" => "οκ", "ολογράφως" => "ολογραφω", "πάγκρεας" => "παγκρε", "πέρας" => "περ", "πίτσα" => "πιτσ", "πίτσας" => "πιτσ", "πίτσες" => "πιτσ", "παγκρέατος" => "παγκρε", "πατέρας" => "πατερ", "πατεράδες" => "πατερ", "πατερίτσες" => "πατεριτσ", "πατερας" => "πατερ", "πολυεστέρας" => "πολυ", "προπυλεστέρας" => "προπυλ", "σαπουνόπερας" => "σαπουνοπερ", "σαράκι" => "σαρακ", "σαφώς" => "σαφω", "σιωνιστές" => "σ", "σφαγίων" => "σφα", "τέρας" => "τερ", "τέρατος" => "τερ", "φαινυλεστέρας" => "φαινυλ", "φως" => "φω", "φωτός" => "φω", "φώς" => "φω", "όπερας" => "οπερ"]

$letter_map

A map from lower case Greek letters with or without diacritic marks to to lower case Greek Letters some that keep their marks, some that don't


    public
    static    array<string|int, mixed>
    $letter_map
     = ["α" => "α", "β" => "β", "γ" => "γ", "δ" => "δ", "ε" => "ε", "ζ" => "ζ", "η" => "η", "θ" => "θ", "ι" => "ι", "κ" => "κ", "λ" => "λ", "μ" => "μ", "ν" => "ν", "ξ" => "ξ", "ο" => "ο", "π" => "π", "ρ" => "ρ", "σ" => "σ", "τ" => "τ", "υ" => "υ", "φ" => "φ", "χ" => "χ", "ψ" => "ψ", "ω" => "ω", "ά" => "α", "ὰ" => "ὰ", "ᾶ" => "ᾶ", "ἀ" => "ἀ", "ἂ" => "ἂ", "ἄ" => "ἄ", "ἃ" => "ἃ", "έ" => "ε", "ὲ" => "ὲ", "ἑ" => "ἑ", "ἐ" => "ἐ", "ἕ" => "ἕ", "ἓ" => "ἓ", "ἔ" => "ἔ", "ή" => "η", "ὴ" => "ὴ", "ῆ" => "ῆ", "ῇ" => "ῇ", "ἡ" => "ἡ", "ἣ" => "ἣ", "ἧ" => "ἧ", "ἦ" => "ἦ", "ἢ" => "ἢ", "ἤ" => "ἤ", "ό" => "ο", "ὸ" => "ὸ", "ὁ" => "ὁ", "ὅ" => "ὅ", "ὃ" => "ὃ", "ὄ" => "ὄ", "ύ" => "υ", "ὺ" => "ὺ", "ϋ" => "υ", "ῦ" => "ῦ", "ὔ" => "ὔ", "ΰ" => "υ", "ὑ" => "ὑ", "ὐ" => "ὐ", "ὖ" => "ὖ", "ῡ" => "ῡ", "ὕ" => "ὕ", "ὗ" => "ὗ", "ς" => "σ", "ώ" => "ω", "ὡ" => "ὡ", "ῶ" => "ῶ", "ὥ" => "ὥ", "ὼ" => "ὼ", "ῳ" => "ῳ", "ὧ" => "ὧ", "ῷ" => "ῷ", "ᾧ" => "ᾧ", "ὦ" => "ὦ", "ί" => "ι", "ὶ" => "ὶ", "ϊ" => "η", "ῖ" => "ῖ", "ΐ" => "η", "ἱ" => "ἱ", "ἰ" => "ἰ", "ἶ" => "ἶ", "ἷ" => "ἷ", "ἴ" => "ἴ", "ἵ" => "ἵ", "΄" => "΄"]

$no_stem_list

Words we don't want to be stemmed


    public
    static    array<string|int, mixed>
    $no_stem_list
     = []

$stem_step

Used to track which step in the stemming process resulted in th stem which is eventually output (typically, only used by unit tester)


    public
    static    array<string|int, mixed>
    $stem_step

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries For greek, took the top 250 words from https://en.wiktionary.org/ wiki/Wiktionary:Frequency_lists/Greek_wordlist#1-250


    public
    static    array<string|int, mixed>
    $stop_words
     = ['http', 'https', "να", "το", "δεν", "είναι", "θα", "και", "μου", "με", "για", "την", "σου", "τον", "τα", "που", "σε", "τι", "του", "αυτό", "ότι", "στο", "από", "της", "τη", "όχι", "ναι", "αν", "ένα", "τους", "εδώ", "μια", "αλλά", "μας", "είσαι", "σας", "ήταν", "πρέπει", "είμαι", "κι", "οι", "στην", "πολύ", "γιατί", "δε", "εγώ", "πως", "τώρα", "εντάξει", "ξέρω", "κάτι", "τις", "έχει", "έχω", "εσύ", "μην", "θέλω", "καλά", "έτσι", "στη", "στον", "αυτή", "ξέρεις", "κάνεις", "έχεις", "όταν", "μπορώ", "μόνο", "εκεί", "σαν", "μαζί", "πώς", "τίποτα", "κάνω", "όλα", "ευχαριστώ", "μπορεί", "κάνει", "ποτέ", "απ", "τόσο", "στα", "αυτά", "πού", "πάμε", "μέσα", "των", "μπορείς", "πιο", "υπάρχει", "ακόμα", "απλά", "έλα", "έχουμε", "αυτός", "σπίτι", "λοιπόν", "είμαστε", "τότε", "πίσω", "παρακαλώ", "μετά", "πριν", "ίσως", "λίγο", "νομίζω", "κύριε", "γεια", "ένας", "πάντα", "πω", "ποιος", "δουλειά", "μη", "δω", "λες", "αλήθεια", "όπως", "παιδιά", "όλοι", "είπε", "γι", "θέλεις", "άλλο", "δύο", "ας", "ζωή", "είχε", "έναν", "κάνουμε", "πάω", "οχι", "ωραία", "καλό", "είπα", "θες", "πες", "στις", "κοίτα", "πάνω", "έξω", "σένα", "χρόνια", "ώρα", "έχουν", "ούτε", "μία", "μα", "κάτω", "μένα", "φορά", "μέρα", "ήμουν", "κάποιος", "έπρεπε", "κάθε", "μέχρι", "κανείς", "καλή", "όμως", "επειδή", "γυναίκα", "πράγματα", "είστε", "είχα", "χωρίς", "ήθελα", "σωστά", "θέλει", "μαμά", "μπορούμε", "μόλις", "δυο", "πάει", "λέει", "θεέ", "πας", "καλύτερα", "ειναι", "σήμερα", "έγινε", "έκανε", "ακριβώς", "πόσο", "συγγνώμη", "πεις", "αρέσει", "έκανα", "συμβαίνει", "λυπάμαι", "πολλά", "φαίνεται", "www", "πρόβλημα", "εμένα", "είπες", "κάποιον", "στιγμή", "αυτόν", "λάθος", "μέρος", "γίνει", "όσο", "λένε", "λεφτά", "περίμενε", "χρόνο", "παιδί", "άλλη", "βλέπω", "πράγμα", "απο", "εσένα", "έκανες", "φυσικά", "δικό", "ήσουν", "γρήγορα", "πάλι", "στους", "πιστεύω", "κάποια", "ως", "φίλε", "οπότε", "μάλλον", "πάρω", "μπαμπά", "γίνεται", "λέω", "έχετε", "υπάρχουν", "ξέρει", "ιδέα", "χρειάζεται", "όλο", "ίδιο", "πήγαινε", "νομίζεις", "σίγουρα", "οτι", "συγνώμη", "πάρει", "μωρό", "εσείς", "νέα", "όλη", "μητέρα", "σημαίνει", "φορές", "εμείς", "είδα"]

$suffix_patterns

Associative array of suffixes to replace with simplified suffixes.


    public
    static    array<string|int, mixed>
    $suffix_patterns
     = [
    "φαγια" => "φα",
    "φαγιου" => "φα",
    "φαγιου" => "φα",
    "σκαγια" => "σκα",
    "σκαγιου" => "σκα",
    "σκαγιων" => "σκα",
    "ολογιου" => "ολο",
    "ολογια" => "ολο",
    "ολογιων" => "ολο",
    "σογιου" => "σο",
    "σογια" => "σο",
    "σογιων" => "σο",
    "τατογια" => "τατο",
    "τατογιου" => "τατο",
    "τατογιων" => "τατο",
    "κρεας" => "κρε",
    "κρεατος" => "κρε",
    "κρεατα" => "κρε",
    "κρεατων" => "κρε",
    "περας" => "περ",
    "περατος" => "περ",
    "περατη" => "περ",
    //added by spyros . also in step1 regex
    "περατα" => "περ",
    "περατων" => "περ",
    "τερας" => "τερ",
    "τερατος" => "τερ",
    "τερατα" => "τερ",
    "τερατων" => "τερ",
    "φως" => "φω",
    "φωτος" => "φω",
    "φωτα" => "φω",
    "φωτων" => "φω",
    "καθεστως" => "καθεστ",
    "καθεστωτος" => "καθεστ",
    "καθεστωτα" => "καθεστ",
    "καθεστωτων" => "καθεστ",
    "γεγονος" => "γεγον",
    "γεγονοτος" => "γεγον",
    "γεγονοτα" => "γεγον",
    "γεγονοτων" => "γεγον",
]

Used in @see regexStem

regexStem()

Check is $word matches the regex $capture_stem_regex. If so chenages $word to the capture group of that regex. It then checks the regexes in $exception_regexes either in sequences or until first match. If a match is found $word has a corresponding exception stem added back to its end.


    public
            static        regexStem(string &$word, string $capture_stem_regex, mixed $exception_regexes[, string $exception_stem = "dummy" ][, bool $with_break = true ][, bool $use_suffix = false ]) : bool

Parameters

$word : string: term to be stemmed
$capture_stem_regex : string: a regex of format: /^(stem_pattern)(suffix_pattern)$/ to check against $word. ui is added to the pattern before used to enable unicode.
$exception_regexes : mixed: either a string single exception suffix to look for or an array of suffixes to look for, or an associative array of items append_stem => exception_regex
$exception_stem : string = "dummy": if $exception_regexes is not an associative array this should be the suffix to append to word if an exception_regex matches
$with_break : bool = true: if true, then the checking of $exception_regexes is only done till the first match is found. If false, all regexes are checked against
$use_suffix : bool = false: if true and $word watches $capture_stem_regex. then suffix_pattern is looked up as a key in the map self::$suffix_patterns. If found, the corresponding value is appendded $word.

Return values

bool —

whether word matched $capture_stem_regex

segment()

This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words separated: "this is a string"


    public
            static        segment(string $pre_segment) : string

Parameters

$pre_segment : string: string to be segmented

Return values

string —

after segmentation done (same string in this case)

stem()

Computes the stem of a Greek word. The document level comments for this class has references to the particular algorithm used.


    public
            static        stem(string $word) : string

Parameters

$word : string: is the word to be stemmed

Return values

string —

stem of $word

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

unmarkLetters()

Used to remove some diacritic marks from greek characters in a term


    public
            static        unmarkLetters(string $word) : string

Parameters

$word : string: term to remove diacritic marks from

Return values

string —

with marks removed

Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Application

Tags

Table of Contents

Properties

$dictionary_stems

$letter_map

$no_stem_list

$stem_step

$stop_words

$suffix_patterns

Methods

regexStem()

Parameters

Return values

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

unmarkLetters()

Parameters

Return values

Search results

Tokenizer in package Application

Tags

Table of Contents

Properties

$dictionary_stems

$letter_map

$no_stem_list

$stem_step

$stop_words

$suffix_patterns

Methods

regexStem()

Parameters

Return values

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

unmarkLetters()

Parameters

Return values

Tokenizer
in package

Application