Tokenizer
in package
Greek specific tokenization code. Contains a list of greek stop words used in making word clouds. It also has a greek stemmer.
This stemmer is based on the algorithms described in Ntais, Georgios. Development of a Stemmer for the Greek Language. Diss. Royal Institute of Technology, 2006. and Saroukos, Spyridon. Enhancing a Greek language stemmer. University of Tampere, 2008. From here I looked at the implementation given at: https://snowballstem.org/algorithms/greek/stemmer.html In particular, I looked at the Snowball code, the Javascript Demo code, and the PHP code (GPLv3) in: https://git.drupalcode.org/project/greekstemmer Copyright (c) 2009 Vassilis Spiliopoulos (http://www.psychfamily.gr) Updated by Yannis Karampelas (info@netstudio.gr) in 2011 and 2017 respectively based on earlier work Spyros Saroukos into Drupal CMS.
The code below is largely a complete rewrite to make this work in UTF-8 lower case Greek rather than use upper case iso-8859-7 as the file encoding. Most of the repetitive code has been refactored into a method regexStem which is repeatedly called with different regex expressions.
Tags
Table of Contents
- $dictionary_stems : array<string|int, mixed>
- This is a list of hard-coded stems. I got the test file (90000 plus terms) on the snowball site to work except for this list, so I brute forced it. My suspicion why all cases didn't work is something to do with my diacritic mark handling.
- $letter_map : array<string|int, mixed>
- A map from lower case Greek letters with or without diacritic marks to to lower case Greek Letters some that keep their marks, some that don't
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $stem_step : array<string|int, mixed>
- Used to track which step in the stemming process resulted in th stem which is eventually output (typically, only used by unit tester)
- $stop_words : array<string|int, mixed>
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries For greek, took the top 250 words from https://en.wiktionary.org/ wiki/Wiktionary:Frequency_lists/Greek_wordlist#1-250
- $suffix_patterns : array<string|int, mixed>
- Associative array of suffixes to replace with simplified suffixes.
- regexStem() : bool
- Check is $word matches the regex $capture_stem_regex. If so chenages $word to the capture group of that regex. It then checks the regexes in $exception_regexes either in sequences or until first match. If a match is found $word has a corresponding exception stem added back to its end.
- segment() : string
- This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words separated: "this is a string"
- stem() : string
- Computes the stem of a Greek word. The document level comments for this class has references to the particular algorithm used.
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation)
- unmarkLetters() : string
- Used to remove some diacritic marks from greek characters in a term
Properties
$dictionary_stems
This is a list of hard-coded stems. I got the test file (90000 plus terms) on the snowball site to work except for this list, so I brute forced it. My suspicion why all cases didn't work is something to do with my diacritic mark handling.
public
static array<string|int, mixed>
$dictionary_stems
= ["αιθυλεστέρας" => "αιθυλ", "αμόρφωτος" => "αμορφω", "ανανεώθηκαν" => "αν", "ανανεώθηκε" => "αν", "αντιστρόφως" => "αντιστροφω", "ανωτέρας" => "αν", "αριστεράς" => "αριστερ", "ασαφώς" => "ασαφω", "αστέρας" => "αστερ", "βαλίτσα" => "βαλιτσ", "βαλίτσας" => "βαλιτσ", "βαλίτσες" => "βαλιτσ", "γεγονος" => "γεγον", "γεγονός" => "γεγον", "γεγονότος" => "γεγον", "γιαγιάδες" => "γιαγ", "δευτέρας" => "δε", "διαιωνίζει" => "διαι", "διαιωνίζουν" => "διαι", "εγγράφως" => "εγγραφω", "επτάφωτος" => "επταφω", "εσπέρας" => "εσπερ", "εσωτερισμού" => "εσ", "θυγατέρας" => "θυγατερ", "ισοπροπυλεστέρας" => "ισοπροπυλ", "καθεστώς" => "καθεστ", "καθεστώτος" => "καθεστ", "καλντέρας" => "καλντερ", "κεντροαριστεράς" => "κεντροαριστερ", "κρέας" => "κρε", "κρέατος" => "κρε", "κυράδες" => "κυρ", "λυκόφως" => "λυκοφω", "λυκόφωτος" => "λυκοφω", "μαμάδες" => "μαμ", "μεθυλεστέρας" => "μεθυλ", "μητέρας" => "μητερ", "νεωτέρας" => "νε", "νεωτερισμοί" => "νε", "νεωτερισμούς" => "νε", "νεωτερισμό" => "νε", "νεωτερισμός" => "νε", "νεωτεριστές" => "νε", "νεωτεριστής" => "νε", "νταντάδες" => "νταντ", "νυφίτσα" => "νυφιτσ", "νυφίτσες" => "νυφιτσ", "οκάδες" => "οκ", "ολογράφως" => "ολογραφω", "πάγκρεας" => "παγκρε", "πέρας" => "περ", "πίτσα" => "πιτσ", "πίτσας" => "πιτσ", "πίτσες" => "πιτσ", "παγκρέατος" => "παγκρε", "πατέρας" => "πατερ", "πατεράδες" => "πατερ", "πατερίτσες" => "πατεριτσ", "πατερας" => "πατερ", "πολυεστέρας" => "πολυ", "προπυλεστέρας" => "προπυλ", "σαπουνόπερας" => "σαπουνοπερ", "σαράκι" => "σαρακ", "σαφώς" => "σαφω", "σιωνιστές" => "σ", "σφαγίων" => "σφα", "τέρας" => "τερ", "τέρατος" => "τερ", "φαινυλεστέρας" => "φαινυλ", "φως" => "φω", "φωτός" => "φω", "φώς" => "φω", "όπερας" => "οπερ"]
$letter_map
A map from lower case Greek letters with or without diacritic marks to to lower case Greek Letters some that keep their marks, some that don't
public
static array<string|int, mixed>
$letter_map
= ["α" => "α", "β" => "β", "γ" => "γ", "δ" => "δ", "ε" => "ε", "ζ" => "ζ", "η" => "η", "θ" => "θ", "ι" => "ι", "κ" => "κ", "λ" => "λ", "μ" => "μ", "ν" => "ν", "ξ" => "ξ", "ο" => "ο", "π" => "π", "ρ" => "ρ", "σ" => "σ", "τ" => "τ", "υ" => "υ", "φ" => "φ", "χ" => "χ", "ψ" => "ψ", "ω" => "ω", "ά" => "α", "ὰ" => "ὰ", "ᾶ" => "ᾶ", "ἀ" => "ἀ", "ἂ" => "ἂ", "ἄ" => "ἄ", "ἃ" => "ἃ", "έ" => "ε", "ὲ" => "ὲ", "ἑ" => "ἑ", "ἐ" => "ἐ", "ἕ" => "ἕ", "ἓ" => "ἓ", "ἔ" => "ἔ", "ή" => "η", "ὴ" => "ὴ", "ῆ" => "ῆ", "ῇ" => "ῇ", "ἡ" => "ἡ", "ἣ" => "ἣ", "ἧ" => "ἧ", "ἦ" => "ἦ", "ἢ" => "ἢ", "ἤ" => "ἤ", "ό" => "ο", "ὸ" => "ὸ", "ὁ" => "ὁ", "ὅ" => "ὅ", "ὃ" => "ὃ", "ὄ" => "ὄ", "ύ" => "υ", "ὺ" => "ὺ", "ϋ" => "υ", "ῦ" => "ῦ", "ὔ" => "ὔ", "ΰ" => "υ", "ὑ" => "ὑ", "ὐ" => "ὐ", "ὖ" => "ὖ", "ῡ" => "ῡ", "ὕ" => "ὕ", "ὗ" => "ὗ", "ς" => "σ", "ώ" => "ω", "ὡ" => "ὡ", "ῶ" => "ῶ", "ὥ" => "ὥ", "ὼ" => "ὼ", "ῳ" => "ῳ", "ὧ" => "ὧ", "ῷ" => "ῷ", "ᾧ" => "ᾧ", "ὦ" => "ὦ", "ί" => "ι", "ὶ" => "ὶ", "ϊ" => "η", "ῖ" => "ῖ", "ΐ" => "η", "ἱ" => "ἱ", "ἰ" => "ἰ", "ἶ" => "ἶ", "ἷ" => "ἷ", "ἴ" => "ἴ", "ἵ" => "ἵ", "΄" => "΄"]
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= []
$stem_step
Used to track which step in the stemming process resulted in th stem which is eventually output (typically, only used by unit tester)
public
static array<string|int, mixed>
$stem_step
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries For greek, took the top 250 words from https://en.wiktionary.org/ wiki/Wiktionary:Frequency_lists/Greek_wordlist#1-250
public
static array<string|int, mixed>
$stop_words
= ['http', 'https', "να", "το", "δεν", "είναι", "θα", "και", "μου", "με", "για", "την", "σου", "τον", "τα", "που", "σε", "τι", "του", "αυτό", "ότι", "στο", "από", "της", "τη", "όχι", "ναι", "αν", "ένα", "τους", "εδώ", "μια", "αλλά", "μας", "είσαι", "σας", "ήταν", "πρέπει", "είμαι", "κι", "οι", "στην", "πολύ", "γιατί", "δε", "εγώ", "πως", "τώρα", "εντάξει", "ξέρω", "κάτι", "τις", "έχει", "έχω", "εσύ", "μην", "θέλω", "καλά", "έτσι", "στη", "στον", "αυτή", "ξέρεις", "κάνεις", "έχεις", "όταν", "μπορώ", "μόνο", "εκεί", "σαν", "μαζί", "πώς", "τίποτα", "κάνω", "όλα", "ευχαριστώ", "μπορεί", "κάνει", "ποτέ", "απ", "τόσο", "στα", "αυτά", "πού", "πάμε", "μέσα", "των", "μπορείς", "πιο", "υπάρχει", "ακόμα", "απλά", "έλα", "έχουμε", "αυτός", "σπίτι", "λοιπόν", "είμαστε", "τότε", "πίσω", "παρακαλώ", "μετά", "πριν", "ίσως", "λίγο", "νομίζω", "κύριε", "γεια", "ένας", "πάντα", "πω", "ποιος", "δουλειά", "μη", "δω", "λες", "αλήθεια", "όπως", "παιδιά", "όλοι", "είπε", "γι", "θέλεις", "άλλο", "δύο", "ας", "ζωή", "είχε", "έναν", "κάνουμε", "πάω", "οχι", "ωραία", "καλό", "είπα", "θες", "πες", "στις", "κοίτα", "πάνω", "έξω", "σένα", "χρόνια", "ώρα", "έχουν", "ούτε", "μία", "μα", "κάτω", "μένα", "φορά", "μέρα", "ήμουν", "κάποιος", "έπρεπε", "κάθε", "μέχρι", "κανείς", "καλή", "όμως", "επειδή", "γυναίκα", "πράγματα", "είστε", "είχα", "χωρίς", "ήθελα", "σωστά", "θέλει", "μαμά", "μπορούμε", "μόλις", "δυο", "πάει", "λέει", "θεέ", "πας", "καλύτερα", "ειναι", "σήμερα", "έγινε", "έκανε", "ακριβώς", "πόσο", "συγγνώμη", "πεις", "αρέσει", "έκανα", "συμβαίνει", "λυπάμαι", "πολλά", "φαίνεται", "www", "πρόβλημα", "εμένα", "είπες", "κάποιον", "στιγμή", "αυτόν", "λάθος", "μέρος", "γίνει", "όσο", "λένε", "λεφτά", "περίμενε", "χρόνο", "παιδί", "άλλη", "βλέπω", "πράγμα", "απο", "εσένα", "έκανες", "φυσικά", "δικό", "ήσουν", "γρήγορα", "πάλι", "στους", "πιστεύω", "κάποια", "ως", "φίλε", "οπότε", "μάλλον", "πάρω", "μπαμπά", "γίνεται", "λέω", "έχετε", "υπάρχουν", "ξέρει", "ιδέα", "χρειάζεται", "όλο", "ίδιο", "πήγαινε", "νομίζεις", "σίγουρα", "οτι", "συγνώμη", "πάρει", "μωρό", "εσείς", "νέα", "όλη", "μητέρα", "σημαίνει", "φορές", "εμείς", "είδα"]
$suffix_patterns
Associative array of suffixes to replace with simplified suffixes.
public
static array<string|int, mixed>
$suffix_patterns
= [
"φαγια" => "φα",
"φαγιου" => "φα",
"φαγιου" => "φα",
"σκαγια" => "σκα",
"σκαγιου" => "σκα",
"σκαγιων" => "σκα",
"ολογιου" => "ολο",
"ολογια" => "ολο",
"ολογιων" => "ολο",
"σογιου" => "σο",
"σογια" => "σο",
"σογιων" => "σο",
"τατογια" => "τατο",
"τατογιου" => "τατο",
"τατογιων" => "τατο",
"κρεας" => "κρε",
"κρεατος" => "κρε",
"κρεατα" => "κρε",
"κρεατων" => "κρε",
"περας" => "περ",
"περατος" => "περ",
"περατη" => "περ",
//added by spyros . also in step1 regex
"περατα" => "περ",
"περατων" => "περ",
"τερας" => "τερ",
"τερατος" => "τερ",
"τερατα" => "τερ",
"τερατων" => "τερ",
"φως" => "φω",
"φωτος" => "φω",
"φωτα" => "φω",
"φωτων" => "φω",
"καθεστως" => "καθεστ",
"καθεστωτος" => "καθεστ",
"καθεστωτα" => "καθεστ",
"καθεστωτων" => "καθεστ",
"γεγονος" => "γεγον",
"γεγονοτος" => "γεγον",
"γεγονοτα" => "γεγον",
"γεγονοτων" => "γεγον",
]
Used in @see regexStem
Methods
regexStem()
Check is $word matches the regex $capture_stem_regex. If so chenages $word to the capture group of that regex. It then checks the regexes in $exception_regexes either in sequences or until first match. If a match is found $word has a corresponding exception stem added back to its end.
public
static regexStem(string &$word, string $capture_stem_regex, mixed $exception_regexes[, string $exception_stem = "dummy" ][, bool $with_break = true ][, bool $use_suffix = false ]) : bool
Parameters
- $word : string
-
term to be stemmed
- $capture_stem_regex : string
-
a regex of format: /^(stem_pattern)(suffix_pattern)$/ to check against $word. ui is added to the pattern before used to enable unicode.
- $exception_regexes : mixed
-
either a string single exception suffix to look for or an array of suffixes to look for, or an associative array of items append_stem => exception_regex
- $exception_stem : string = "dummy"
-
if $exception_regexes is not an associative array this should be the suffix to append to word if an exception_regex matches
- $with_break : bool = true
-
if true, then the checking of $exception_regexes is only done till the first match is found. If false, all regexes are checked against
- $use_suffix : bool = false
-
if true and $word watches $capture_stem_regex. then suffix_pattern is looked up as a key in the map self::$suffix_patterns. If found, the corresponding value is appendded $word.
Return values
bool —whether word matched $capture_stem_regex
segment()
This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words separated: "this is a string"
public
static segment(string $pre_segment) : string
Parameters
- $pre_segment : string
-
string to be segmented
Return values
string —after segmentation done (same string in this case)
stem()
Computes the stem of a Greek word. The document level comments for this class has references to the particular algorithm used.
public
static stem(string $word) : string
Parameters
- $word : string
-
is the word to be stemmed
Return values
string —stem of $word
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
unmarkLetters()
Used to remove some diacritic marks from greek characters in a term
public
static unmarkLetters(string $word) : string
Parameters
- $word : string
-
term to remove diacritic marks from
Return values
string —with marks removed