Yioop_V9.5_Source_Code

Tokenizer
in package

Application

Arabic specific tokenization code. In particular, it has a stemmer, The stemmer is my stab at porting Ljiljana Dolamic (University of Neuchatel, www.unine.ch/info/clef/) C stemming algorithm: http://members.unine.ch/jacques.savoy/clef That algorithm maps all stems to ASCII. Instead, I tried to leave everything using Arabic characters.

$no_stem_list

Words we don't want to be stemmed


    public
    static    array<string|int, mixed>
    $no_stem_list
     = []

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries


    public
    static    mixed
    $stop_words
     = ["ا", "أ", "،", "عشر", "عدد", "عدة", "عشرة", "عدم", "عام", "عاما", "عن", "عند", "عندما", "على", "عليه", "عليها", "زيارة", "سنة", "سنوات", "تم", "ضد", "بعد", "بعض", "اعادة", "اعلنت", "بسبب", "حتى", "اذا", "احد", "اثر", "برس", "باسم", "غدا", "شخصا", "صباح", "اطار", "اربعة", "اخرى", "بان", "اجل", "غير", "بشكل", "حاليا", "بن", "به", "ثم", "اف", "ان", "او", "اي", "بها", "صفر", "حيث", "اكد", "الا", "اما", "امس", "السابق", "التى", "التي", "اكثر", "ايار", "ايضا", "ثلاثة", "الذاتي", "الاخيرة", "الثاني", "الثانية", "الذى", "الذي", "الان", "امام", "ايام", "خلال", "حوالى", "الذين", "الاول", "الاولى", "بين", "ذلك", "دون", "حول", "حين", "الف", "الى", "انه", "اول", "ضمن", "انها", "جميع", "الماضي", "الوقت", "المقبل", "اليوم", "ـ", "ف", "و", "و6", "قد", "لا", "ما", "مع", "مساء", "هذا", "واحد", "واضاف", "واضافت", "فان", "قبل", "قال", "كان", "لدى", "نحو", "هذه", "وان", "واكد", "كانت", "واوضح", "مايو", "فى", "في", "كل", "لم", "لن", "له", "من", "هو", "هي", "قوة", "كما", "لها", "منذ", "وقد", "ولا", "نفسه", "لقاء", "مقابل", "هناك", "وقال", "وكان", "نهاية", "وقالت", "وكانت", "للامم", "فيه", "كلم", "لكن", "وفي", "وقف", "ولم", "ومن", "وهو", "وهي", "يوم", "فيها", "منها", "مليار", "لوكالة", "يكون", "يمكن", "مليون"]

segment()

Stub function which could be used for a word segmenter.


    public
            static        segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

$pre_segment : string: before segmentation

Return values

string —

should return string with words separated by space in this case does nothing

stem()

Computes the stem of an Arabic word


    public
            static        stem(string $word) : string

Parameters

$word : string: the string to stem

Return values

string —

the stem of $word

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

removeModifiersAndArchaic()

Removes common letter modifiers as well as some archaic characters


    private
            static        removeModifiersAndArchaic(string $word) : string

Parameters

$word : string

Return values

string —

the $word after letter modifiers removed

removePrefix()

Removes Arabic prefixes to get root


    private
            static        removePrefix(string $word) : string

Parameters

$word : string: word to remove prefixes from

Return values

string —

the $word after prefix removal

removeSuffix()

Removes Arabic suffixes to get root


    private
            static        removeSuffix(string $word) : string

Parameters

$word : string: word to remove suffixes from

Return values

string —

the $word after suffix removal

Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Application

Tags

Table of Contents

Properties

$no_stem_list

$stop_words

Tags

Methods

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

removeModifiersAndArchaic()

Parameters

Return values

removePrefix()

Parameters

Return values

removeSuffix()

Parameters

Return values

Search results

Tokenizer in package Application

Tags

Table of Contents

Properties

$no_stem_list

$stop_words

Tags

Methods

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

removeModifiersAndArchaic()

Parameters

Return values

removePrefix()

Parameters

Return values

removeSuffix()

Parameters

Return values

Tokenizer
in package

Application