Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Arabic specific tokenization code. In particular, it has a stemmer, The stemmer is my stab at porting Ljiljana Dolamic (University of Neuchatel, www.unine.ch/info/clef/) C stemming algorithm: http://members.unine.ch/jacques.savoy/clef That algorithm maps all stems to ASCII. Instead, I tried to leave everything using Arabic characters.

Tags
author

Chris Pollett

Table of Contents

$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of an Arabic word
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation)
removeModifiersAndArchaic()  : string
Removes common letter modifiers as well as some archaic characters
removePrefix()  : string
Removes Arabic prefixes to get root
removeSuffix()  : string
Removes Arabic suffixes to get root

Properties

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = []

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

public static mixed $stop_words = ["ا", "أ", "،", "عشر", "عدد", "عدة", "عشرة", "عدم", "عام", "عاما", "عن", "عند", "عندما", "على", "عليه", "عليها", "زيارة", "سنة", "سنوات", "تم", "ضد", "بعد", "بعض", "اعادة", "اعلنت", "بسبب", "حتى", "اذا", "احد", "اثر", "برس", "باسم", "غدا", "شخصا", "صباح", "اطار", "اربعة", "اخرى", "بان", "اجل", "غير", "بشكل", "حاليا", "بن", "به", "ثم", "اف", "ان", "او", "اي", "بها", "صفر", "حيث", "اكد", "الا", "اما", "امس", "السابق", "التى", "التي", "اكثر", "ايار", "ايضا", "ثلاثة", "الذاتي", "الاخيرة", "الثاني", "الثانية", "الذى", "الذي", "الان", "امام", "ايام", "خلال", "حوالى", "الذين", "الاول", "الاولى", "بين", "ذلك", "دون", "حول", "حين", "الف", "الى", "انه", "اول", "ضمن", "انها", "جميع", "الماضي", "الوقت", "المقبل", "اليوم", "ـ", "ف", "و", "و6", "قد", "لا", "ما", "مع", "مساء", "هذا", "واحد", "واضاف", "واضافت", "فان", "قبل", "قال", "كان", "لدى", "نحو", "هذه", "وان", "واكد", "كانت", "واوضح", "مايو", "فى", "في", "كل", "لم", "لن", "له", "من", "هو", "هي", "قوة", "كما", "لها", "منذ", "وقد", "ولا", "نفسه", "لقاء", "مقابل", "هناك", "وقال", "وكان", "نهاية", "وقالت", "وكانت", "للامم", "فيه", "كلم", "لكن", "وفي", "وقف", "ولم", "ومن", "وهو", "وهي", "يوم", "فيها", "منها", "مليار", "لوكالة", "يكون", "يمكن", "مليون"]
Tags
array

Methods

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of an Arabic word

public static stem(string $word) : string
Parameters
$word : string

the string to stem

Return values
string

the stem of $word

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

removeModifiersAndArchaic()

Removes common letter modifiers as well as some archaic characters

private static removeModifiersAndArchaic(string $word) : string
Parameters
$word : string
Return values
string

the $word after letter modifiers removed

removePrefix()

Removes Arabic prefixes to get root

private static removePrefix(string $word) : string
Parameters
$word : string

word to remove prefixes from

Return values
string

the $word after prefix removal

removeSuffix()

Removes Arabic suffixes to get root

private static removeSuffix(string $word) : string
Parameters
$word : string

word to remove suffixes from

Return values
string

the $word after suffix removal


        

Search results