Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Greek specific tokenization code. Contains a list of greek stop words used in making word clouds. It also has a greek stemmer.

This stemmer is based on the algorithms described in Ntais, Georgios. Development of a Stemmer for the Greek Language. Diss. Royal Institute of Technology, 2006. and Saroukos, Spyridon. Enhancing a Greek language stemmer. University of Tampere, 2008. From here I looked at the implementation given at: https://snowballstem.org/algorithms/greek/stemmer.html In particular, I looked at the Snowball code, the Javascript Demo code, and the PHP code (GPLv3) in: https://git.drupalcode.org/project/greekstemmer Copyright (c) 2009 Vassilis Spiliopoulos (http://www.psychfamily.gr) Updated by Yannis Karampelas (info@netstudio.gr) in 2011 and 2017 respectively based on earlier work Spyros Saroukos into Drupal CMS.

The code below is largely a complete rewrite to make this work in UTF-8 lower case Greek rather than use upper case iso-8859-7 as the file encoding. Most of the repetitive code has been refactored into a method regexStem which is repeatedly called with different regex expressions.

Tags
author

Chris Pollett

Table of Contents

$dictionary_stems  : array<string|int, mixed>
This is a list of hard-coded stems. I got the test file (90000 plus terms) on the snowball site to work except for this list, so I brute forced it. My suspicion why all cases didn't work is something to do with my diacritic mark handling.
$letter_map  : array<string|int, mixed>
A map from lower case Greek letters with or without diacritic marks to to lower case Greek Letters some that keep their marks, some that don't
$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$stem_step  : array<string|int, mixed>
Used to track which step in the stemming process resulted in th stem which is eventually output (typically, only used by unit tester)
$stop_words  : array<string|int, mixed>
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries For greek, took the top 250 words from https://en.wiktionary.org/ wiki/Wiktionary:Frequency_lists/Greek_wordlist#1-250
$suffix_patterns  : array<string|int, mixed>
Associative array of suffixes to replace with simplified suffixes.
regexStem()  : bool
Check is $word matches the regex $capture_stem_regex. If so chenages $word to the capture group of that regex. It then checks the regexes in $exception_regexes either in sequences or until first match. If a match is found $word has a corresponding exception stem added back to its end.
segment()  : string
This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words separated: "this is a string"
stem()  : string
Computes the stem of a Greek word. The document level comments for this class has references to the particular algorithm used.
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation)
unmarkLetters()  : string
Used to remove some diacritic marks from greek characters in a term

Properties

$dictionary_stems

This is a list of hard-coded stems. I got the test file (90000 plus terms) on the snowball site to work except for this list, so I brute forced it. My suspicion why all cases didn't work is something to do with my diacritic mark handling.

public static array<string|int, mixed> $dictionary_stems = ["αιθυλεστέρας" => "αιθυλ", "αμόρφωτος" => "αμορφω", "ανανεώθηκαν" => "αν", "ανανεώθηκε" => "αν", "αντιστρόφως" => "αντιστροφω", "ανωτέρας" => "αν", "αριστεράς" => "αριστερ", "ασαφώς" => "ασαφω", "αστέρας" => "αστερ", "βαλίτσα" => "βαλιτσ", "βαλίτσας" => "βαλιτσ", "βαλίτσες" => "βαλιτσ", "γεγονος" => "γεγον", "γεγονός" => "γεγον", "γεγονότος" => "γεγον", "γιαγιάδες" => "γιαγ", "δευτέρας" => "δε", "διαιωνίζει" => "διαι", "διαιωνίζουν" => "διαι", "εγγράφως" => "εγγραφω", "επτάφωτος" => "επταφω", "εσπέρας" => "εσπερ", "εσωτερισμού" => "εσ", "θυγατέρας" => "θυγατερ", "ισοπροπυλεστέρας" => "ισοπροπυλ", "καθεστώς" => "καθεστ", "καθεστώτος" => "καθεστ", "καλντέρας" => "καλντερ", "κεντροαριστεράς" => "κεντροαριστερ", "κρέας" => "κρε", "κρέατος" => "κρε", "κυράδες" => "κυρ", "λυκόφως" => "λυκοφω", "λυκόφωτος" => "λυκοφω", "μαμάδες" => "μαμ", "μεθυλεστέρας" => "μεθυλ", "μητέρας" => "μητερ", "νεωτέρας" => "νε", "νεωτερισμοί" => "νε", "νεωτερισμούς" => "νε", "νεωτερισμό" => "νε", "νεωτερισμός" => "νε", "νεωτεριστές" => "νε", "νεωτεριστής" => "νε", "νταντάδες" => "νταντ", "νυφίτσα" => "νυφιτσ", "νυφίτσες" => "νυφιτσ", "οκάδες" => "οκ", "ολογράφως" => "ολογραφω", "πάγκρεας" => "παγκρε", "πέρας" => "περ", "πίτσα" => "πιτσ", "πίτσας" => "πιτσ", "πίτσες" => "πιτσ", "παγκρέατος" => "παγκρε", "πατέρας" => "πατερ", "πατεράδες" => "πατερ", "πατερίτσες" => "πατεριτσ", "πατερας" => "πατερ", "πολυεστέρας" => "πολυ", "προπυλεστέρας" => "προπυλ", "σαπουνόπερας" => "σαπουνοπερ", "σαράκι" => "σαρακ", "σαφώς" => "σαφω", "σιωνιστές" => "σ", "σφαγίων" => "σφα", "τέρας" => "τερ", "τέρατος" => "τερ", "φαινυλεστέρας" => "φαινυλ", "φως" => "φω", "φωτός" => "φω", "φώς" => "φω", "όπερας" => "οπερ"]

$letter_map

A map from lower case Greek letters with or without diacritic marks to to lower case Greek Letters some that keep their marks, some that don't

public static array<string|int, mixed> $letter_map = ["α" => "α", "β" => "β", "γ" => "γ", "δ" => "δ", "ε" => "ε", "ζ" => "ζ", "η" => "η", "θ" => "θ", "ι" => "ι", "κ" => "κ", "λ" => "λ", "μ" => "μ", "ν" => "ν", "ξ" => "ξ", "ο" => "ο", "π" => "π", "ρ" => "ρ", "σ" => "σ", "τ" => "τ", "υ" => "υ", "φ" => "φ", "χ" => "χ", "ψ" => "ψ", "ω" => "ω", "ά" => "α", "ὰ" => "ὰ", "ᾶ" => "ᾶ", "ἀ" => "ἀ", "ἂ" => "ἂ", "ἄ" => "ἄ", "ἃ" => "ἃ", "έ" => "ε", "ὲ" => "ὲ", "ἑ" => "ἑ", "ἐ" => "ἐ", "ἕ" => "ἕ", "ἓ" => "ἓ", "ἔ" => "ἔ", "ή" => "η", "ὴ" => "ὴ", "ῆ" => "ῆ", "ῇ" => "ῇ", "ἡ" => "ἡ", "ἣ" => "ἣ", "ἧ" => "ἧ", "ἦ" => "ἦ", "ἢ" => "ἢ", "ἤ" => "ἤ", "ό" => "ο", "ὸ" => "ὸ", "ὁ" => "ὁ", "ὅ" => "ὅ", "ὃ" => "ὃ", "ὄ" => "ὄ", "ύ" => "υ", "ὺ" => "ὺ", "ϋ" => "υ", "ῦ" => "ῦ", "ὔ" => "ὔ", "ΰ" => "υ", "ὑ" => "ὑ", "ὐ" => "ὐ", "ὖ" => "ὖ", "ῡ" => "ῡ", "ὕ" => "ὕ", "ὗ" => "ὗ", "ς" => "σ", "ώ" => "ω", "ὡ" => "ὡ", "ῶ" => "ῶ", "ὥ" => "ὥ", "ὼ" => "ὼ", "ῳ" => "ῳ", "ὧ" => "ὧ", "ῷ" => "ῷ", "ᾧ" => "ᾧ", "ὦ" => "ὦ", "ί" => "ι", "ὶ" => "ὶ", "ϊ" => "η", "ῖ" => "ῖ", "ΐ" => "η", "ἱ" => "ἱ", "ἰ" => "ἰ", "ἶ" => "ἶ", "ἷ" => "ἷ", "ἴ" => "ἴ", "ἵ" => "ἵ", "΄" => "΄"]

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = []

$stem_step

Used to track which step in the stemming process resulted in th stem which is eventually output (typically, only used by unit tester)

public static array<string|int, mixed> $stem_step

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries For greek, took the top 250 words from https://en.wiktionary.org/ wiki/Wiktionary:Frequency_lists/Greek_wordlist#1-250

public static array<string|int, mixed> $stop_words = ['http', 'https', "να", "το", "δεν", "είναι", "θα", "και", "μου", "με", "για", "την", "σου", "τον", "τα", "που", "σε", "τι", "του", "αυτό", "ότι", "στο", "από", "της", "τη", "όχι", "ναι", "αν", "ένα", "τους", "εδώ", "μια", "αλλά", "μας", "είσαι", "σας", "ήταν", "πρέπει", "είμαι", "κι", "οι", "στην", "πολύ", "γιατί", "δε", "εγώ", "πως", "τώρα", "εντάξει", "ξέρω", "κάτι", "τις", "έχει", "έχω", "εσύ", "μην", "θέλω", "καλά", "έτσι", "στη", "στον", "αυτή", "ξέρεις", "κάνεις", "έχεις", "όταν", "μπορώ", "μόνο", "εκεί", "σαν", "μαζί", "πώς", "τίποτα", "κάνω", "όλα", "ευχαριστώ", "μπορεί", "κάνει", "ποτέ", "απ", "τόσο", "στα", "αυτά", "πού", "πάμε", "μέσα", "των", "μπορείς", "πιο", "υπάρχει", "ακόμα", "απλά", "έλα", "έχουμε", "αυτός", "σπίτι", "λοιπόν", "είμαστε", "τότε", "πίσω", "παρακαλώ", "μετά", "πριν", "ίσως", "λίγο", "νομίζω", "κύριε", "γεια", "ένας", "πάντα", "πω", "ποιος", "δουλειά", "μη", "δω", "λες", "αλήθεια", "όπως", "παιδιά", "όλοι", "είπε", "γι", "θέλεις", "άλλο", "δύο", "ας", "ζωή", "είχε", "έναν", "κάνουμε", "πάω", "οχι", "ωραία", "καλό", "είπα", "θες", "πες", "στις", "κοίτα", "πάνω", "έξω", "σένα", "χρόνια", "ώρα", "έχουν", "ούτε", "μία", "μα", "κάτω", "μένα", "φορά", "μέρα", "ήμουν", "κάποιος", "έπρεπε", "κάθε", "μέχρι", "κανείς", "καλή", "όμως", "επειδή", "γυναίκα", "πράγματα", "είστε", "είχα", "χωρίς", "ήθελα", "σωστά", "θέλει", "μαμά", "μπορούμε", "μόλις", "δυο", "πάει", "λέει", "θεέ", "πας", "καλύτερα", "ειναι", "σήμερα", "έγινε", "έκανε", "ακριβώς", "πόσο", "συγγνώμη", "πεις", "αρέσει", "έκανα", "συμβαίνει", "λυπάμαι", "πολλά", "φαίνεται", "www", "πρόβλημα", "εμένα", "είπες", "κάποιον", "στιγμή", "αυτόν", "λάθος", "μέρος", "γίνει", "όσο", "λένε", "λεφτά", "περίμενε", "χρόνο", "παιδί", "άλλη", "βλέπω", "πράγμα", "απο", "εσένα", "έκανες", "φυσικά", "δικό", "ήσουν", "γρήγορα", "πάλι", "στους", "πιστεύω", "κάποια", "ως", "φίλε", "οπότε", "μάλλον", "πάρω", "μπαμπά", "γίνεται", "λέω", "έχετε", "υπάρχουν", "ξέρει", "ιδέα", "χρειάζεται", "όλο", "ίδιο", "πήγαινε", "νομίζεις", "σίγουρα", "οτι", "συγνώμη", "πάρει", "μωρό", "εσείς", "νέα", "όλη", "μητέρα", "σημαίνει", "φορές", "εμείς", "είδα"]

$suffix_patterns

Associative array of suffixes to replace with simplified suffixes.

public static array<string|int, mixed> $suffix_patterns = [ "φαγια" => "φα", "φαγιου" => "φα", "φαγιου" => "φα", "σκαγια" => "σκα", "σκαγιου" => "σκα", "σκαγιων" => "σκα", "ολογιου" => "ολο", "ολογια" => "ολο", "ολογιων" => "ολο", "σογιου" => "σο", "σογια" => "σο", "σογιων" => "σο", "τατογια" => "τατο", "τατογιου" => "τατο", "τατογιων" => "τατο", "κρεας" => "κρε", "κρεατος" => "κρε", "κρεατα" => "κρε", "κρεατων" => "κρε", "περας" => "περ", "περατος" => "περ", "περατη" => "περ", //added by spyros . also in step1 regex "περατα" => "περ", "περατων" => "περ", "τερας" => "τερ", "τερατος" => "τερ", "τερατα" => "τερ", "τερατων" => "τερ", "φως" => "φω", "φωτος" => "φω", "φωτα" => "φω", "φωτων" => "φω", "καθεστως" => "καθεστ", "καθεστωτος" => "καθεστ", "καθεστωτα" => "καθεστ", "καθεστωτων" => "καθεστ", "γεγονος" => "γεγον", "γεγονοτος" => "γεγον", "γεγονοτα" => "γεγον", "γεγονοτων" => "γεγον", ]

Used in @see regexStem

Methods

regexStem()

Check is $word matches the regex $capture_stem_regex. If so chenages $word to the capture group of that regex. It then checks the regexes in $exception_regexes either in sequences or until first match. If a match is found $word has a corresponding exception stem added back to its end.

public static regexStem(string &$word, string $capture_stem_regex, mixed $exception_regexes[, string $exception_stem = "dummy" ][, bool $with_break = true ][, bool $use_suffix = false ]) : bool
Parameters
$word : string

term to be stemmed

$capture_stem_regex : string

a regex of format: /^(stem_pattern)(suffix_pattern)$/ to check against $word. ui is added to the pattern before used to enable unicode.

$exception_regexes : mixed

either a string single exception suffix to look for or an array of suffixes to look for, or an associative array of items append_stem => exception_regex

$exception_stem : string = "dummy"

if $exception_regexes is not an associative array this should be the suffix to append to word if an exception_regex matches

$with_break : bool = true

if true, then the checking of $exception_regexes is only done till the first match is found. If false, all regexes are checked against

$use_suffix : bool = false

if true and $word watches $capture_stem_regex. then suffix_pattern is looked up as a key in the map self::$suffix_patterns. If found, the corresponding value is appendded $word.

Return values
bool

whether word matched $capture_stem_regex

segment()

This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words separated: "this is a string"

public static segment(string $pre_segment) : string
Parameters
$pre_segment : string

string to be segmented

Return values
string

after segmentation done (same string in this case)

stem()

Computes the stem of a Greek word. The document level comments for this class has references to the particular algorithm used.

public static stem(string $word) : string
Parameters
$word : string

is the word to be stemmed

Return values
string

stem of $word

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

unmarkLetters()

Used to remove some diacritic marks from greek characters in a term

public static unmarkLetters(string $word) : string
Parameters
$word : string

term to remove diacritic marks from

Return values
string

with marks removed


        

Search results