Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Persian specific tokenization code. In particular, it has a stemmer, The stemmer is a modified variant (handling prefixes slightly differently) of my stab at porting Nick Patch's Perl port, https://metacpan.org/pod/Lingua::Stem::UniNE::FA, of the stemming algorithm by Ljiljana Dolamic and Jacques Savoy of the University of Neuchâtel. The Java version of this is at http://members.unine.ch/jacques.savoy/clef/persianStemmerUnicode.txt (beware of Java's handling of Unicode).

Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Tags
author

Chris Pollett

Table of Contents

$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of a Persian word
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation)
normalize()  : string
Performs additional end word stripping
removeKasra()  : string
Removes a Kasra diacritic mark if appears at the end of a word.
removeSuffix()  : string
Removes common Persian suffixes
simplifyPrefix()  : string
Simplifies prefixes beginning with آ to ا

Properties

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = []

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

public static mixed $stop_words = ["در", "به", "از", "كه", "مي", "اين", "است", "را", "با", "هاي", "براي", "آن", "يك", "شود", "شده", "خود", "ها", "كرد", "شد", "اي", "تا", "كند", "بر", "بود", "گفت", "نيز", "وي", "هم", "كنند", "دارد", "ما", "كرده", "يا", "اما", "بايد", "دو", "اند", "هر", "خواهد", "او", "مورد", "آنها", "باشد", "ديگر", "مردم", "نمي", "بين", "پيش", "پس", "اگر", "همه", "صورت", "يكي", "هستند", "بي", "من", "دهد", "هزار", "نيست", "استفاده", "داد", "داشته", "راه", "داشت", "چه", "همچنين", "كردند", "داده", "بوده", "دارند", "همين", "ميليون", "سوي", "شوند", "بيشتر", "بسيار", "روي", "گرفته", "هايي", "تواند", "اول", "نام", "هيچ", "چند", "جديد", "بيش", "شدن", "كردن", "كنيم", "نشان", "حتي", "اينكه", "ولی", "توسط", "چنين", "برخي", "نه", "ديروز", "دوم", "درباره", "بعد", "مختلف", "گيرد", "شما", "گفته", "آنان", "بار", "طور", "گرفت", "دهند", "گذاري", "بسياري", "طي", "بودند", "ميليارد", "بدون", "تمام", "كل", "تر", "براساس", "شدند", "ترين", "امروز", "باشند", "ندارد", "چون", "قابل", "گويد", "ديگري", "همان", "خواهند", "قبل", "آمده", "اكنون", "تحت", "طريق", "گيري", "جاي", "هنوز", "چرا", "البته", "كنيد", "سازي", "سوم", "كنم", "بلكه", "زير", "توانند", "ضمن", "فقط", "بودن", "حق", "آيد", "وقتي", "اش", "يابد", "نخستين", "مقابل", "خدمات", "امسال", "تاكنون", "مانند", "تازه", "آورد", "فكر", "آنچه", "نخست", "نشده", "شايد", "چهار", "جريان", "پنج", "ساخته", "زيرا", "نزديك", "برداري", "كسي", "ريزي", "رفت", "گردد", "مثل", "آمد", "ام", "بهترين", "دانست", "كمتر", "دادن", "تمامي", "جلوگيري", "بيشتري", "ايم", "ناشي", "چيزي", "آنكه", "بالا", "بنابراين", "ايشان", "بعضي", "دادند", "داشتند", "برخوردار", "نخواهد", "هنگام", "نبايد", "غير", "نبود", "ديده", "وگو", "داريم", "چگونه", "بندي", "خواست", "فوق", "ده", "نوعي", "هستيم", "ديگران", "همچنان", "سراسر", "ندارند", "گروهي", "سعي", "روزهاي", "آنجا", "يكديگر", "كردم", "بيست", "بروز", "سپس", "رفته", "آورده", "نمايد", "باشيم", "گويند", "زياد", "خويش", "همواره", "گذاشته", "شش", "نداشته", "شناسي", "خواهيم", "آباد", "داشتن", "نظير", "همچون", "باره", "نكرده", "شان", "سابق", "هفت", "دانند", "جايي", "بی", "جز", "زیرِ", "رویِ", "سریِ", "تویِ", "جلویِ", "پیشِ", "عقبِ", "بالایِ", "خارجِ", "وسطِ", "بیرونِ", "سویِ", "کنارِ", "پاعینِ", "نزدِ", "نزدیکِ", "دنبالِ", "حدودِ", "برابرِ", "طبقِ", "مانندِ", "ضدِّ", "هنگامِ", "برایِ", "مثلِ", "بارة", "اثرِ", "تولِ", "علّتِ", "سمتِ", "عنوانِ", "قصدِ", "روب", "جدا", "کی", "که", "چیست", "هست", "کجا", "کجاست", "کَی", "چطور", "کدام", "آیا", "مگر", "چندین", "یک", "چیزی", "دیگر", "کسی", "بعری", "هیچ", "چیز", "جا", "کس", "هرگز", "یا", "تنها", "بلکه", "خیاه", "بله", "بلی", "آره", "آری", "مرسی", "البتّه", "لطفاً", "ّه", "انکه", "وقتیکه", "همین", "پیش", "مدّتی", "هنگامی", "مان", "تان"]
Tags
array

Methods

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a Persian word

public static stem(string $word) : string
Parameters
$word : string

the string to stem

Return values
string

the stem of $word

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

normalize()

Performs additional end word stripping

private static normalize(string $word) : string
Parameters
$word : string

to remove suffixes from

Return values
string

result of suffix removal

removeKasra()

Removes a Kasra diacritic mark if appears at the end of a word.

private static removeKasra(string $word) : string
Parameters
$word : string

word to remove mark from

Return values
string

result of removal

removeSuffix()

Removes common Persian suffixes

private static removeSuffix(string $word) : string
Parameters
$word : string

to remove suffixes from

Return values
string

result of suffix removal

simplifyPrefix()

Simplifies prefixes beginning with آ to ا

private static simplifyPrefix(string $word) : string
Parameters
$word : string

word to remove mark from

Return values
string

result of removal


        

Search results