Tokenizer
in package
Arabic specific tokenization code. In particular, it has a stemmer, The stemmer is my stab at porting Ljiljana Dolamic (University of Neuchatel, www.unine.ch/info/clef/) C stemming algorithm: http://members.unine.ch/jacques.savoy/clef That algorithm maps all stems to ASCII. Instead, I tried to leave everything using Arabic characters.
Tags
Table of Contents
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
- segment() : string
- Stub function which could be used for a word segmenter.
- stem() : string
- Computes the stem of an Arabic word
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation)
- removeModifiersAndArchaic() : string
- Removes common letter modifiers as well as some archaic characters
- removePrefix() : string
- Removes Arabic prefixes to get root
- removeSuffix() : string
- Removes Arabic suffixes to get root
Properties
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= []
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
public
static mixed
$stop_words
= ["ا", "أ", "،", "عشر", "عدد", "عدة", "عشرة", "عدم", "عام", "عاما", "عن", "عند", "عندما", "على", "عليه", "عليها", "زيارة", "سنة", "سنوات", "تم", "ضد", "بعد", "بعض", "اعادة", "اعلنت", "بسبب", "حتى", "اذا", "احد", "اثر", "برس", "باسم", "غدا", "شخصا", "صباح", "اطار", "اربعة", "اخرى", "بان", "اجل", "غير", "بشكل", "حاليا", "بن", "به", "ثم", "اف", "ان", "او", "اي", "بها", "صفر", "حيث", "اكد", "الا", "اما", "امس", "السابق", "التى", "التي", "اكثر", "ايار", "ايضا", "ثلاثة", "الذاتي", "الاخيرة", "الثاني", "الثانية", "الذى", "الذي", "الان", "امام", "ايام", "خلال", "حوالى", "الذين", "الاول", "الاولى", "بين", "ذلك", "دون", "حول", "حين", "الف", "الى", "انه", "اول", "ضمن", "انها", "جميع", "الماضي", "الوقت", "المقبل", "اليوم", "ـ", "ف", "و", "و6", "قد", "لا", "ما", "مع", "مساء", "هذا", "واحد", "واضاف", "واضافت", "فان", "قبل", "قال", "كان", "لدى", "نحو", "هذه", "وان", "واكد", "كانت", "واوضح", "مايو", "فى", "في", "كل", "لم", "لن", "له", "من", "هو", "هي", "قوة", "كما", "لها", "منذ", "وقد", "ولا", "نفسه", "لقاء", "مقابل", "هناك", "وقال", "وكان", "نهاية", "وقالت", "وكانت", "للامم", "فيه", "كلم", "لكن", "وفي", "وقف", "ولم", "ومن", "وهو", "وهي", "يوم", "فيها", "منها", "مليار", "لوكالة", "يكون", "يمكن", "مليون"]
Tags
Methods
segment()
Stub function which could be used for a word segmenter.
public
static segment(string $pre_segment) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
Return values
string —should return string with words separated by space in this case does nothing
stem()
Computes the stem of an Arabic word
public
static stem(string $word) : string
Parameters
- $word : string
-
the string to stem
Return values
string —the stem of $word
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
removeModifiersAndArchaic()
Removes common letter modifiers as well as some archaic characters
private
static removeModifiersAndArchaic(string $word) : string
Parameters
- $word : string
Return values
string —the $word after letter modifiers removed
removePrefix()
Removes Arabic prefixes to get root
private
static removePrefix(string $word) : string
Parameters
- $word : string
-
word to remove prefixes from
Return values
string —the $word after prefix removal
removeSuffix()
Removes Arabic suffixes to get root
private
static removeSuffix(string $word) : string
Parameters
- $word : string
-
word to remove suffixes from
Return values
string —the $word after suffix removal