Tokenizer
in package
Tagalog (spoken in Philipines) specific tokenization code.
Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram
Tags
Table of Contents
- $char_gram_len : int
- How many characters in a char gram for this locale
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation and language detection)
Properties
$char_gram_len
How many characters in a char gram for this locale
public
static int
$char_gram_len
= 5
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
public
static mixed
$stop_words
= ['akin', 'aking', 'ako', 'alin', 'am', 'amin', 'aming', 'ang', 'ano', 'anumang', 'apat', 'at', 'atin', 'ating', 'ay', 'bababa', 'bago', 'bakit', 'bawat', 'bilang', 'dahil', 'dalawa', 'dapat', 'din', 'dito', 'doon', 'gagawin', 'gayunman', 'ginagawa', 'ginawa', 'ginawang', 'gumawa', 'gusto', 'habang', 'hanggang', 'hindi', 'huwag', 'iba', 'ibaba', 'ibabaw', 'ibig', 'ikaw', 'ilagay', 'ilalim', 'ilan', 'inyong', 'isa', 'isang', 'itaas', 'ito', 'iyo', 'iyon', 'iyong', 'ka', 'kahit', 'kailangan', 'kailanman', 'kami', 'kanila', 'kanilang', 'kanino', 'kanya', 'kanyang', 'kapag', 'kapwa', 'karamihan', 'katiyakan', 'katulad', 'kaya', 'kaysa', 'ko', 'kong', 'kulang', 'kumuha', 'kung', 'laban', 'lahat', 'lamang', 'likod', 'lima', 'maaari', 'maaaring', 'maging', 'mahusay', 'makita', 'marami', 'marapat', 'masyado', 'may', 'mayroon', 'mga', 'minsan', 'mismo', 'mula', 'muli', 'na', 'nabanggit', 'naging', 'nagkaroon', 'nais', 'nakita', 'namin', 'napaka', 'narito', 'nasaan', 'ng', 'ngayon', 'ni', 'nila', 'nilang', 'nito', 'niya', 'niyang', 'noon', 'o', 'pa', 'paano', 'pababa', 'paggawa', 'pagitan', 'pagkakaroon', 'pagkatapos', 'palabas', 'pamamagitan', 'panahon', 'pangalawa', 'para', 'paraan', 'pareho', 'pataas', 'pero', 'pumunta', 'pumupunta', 'sa', 'saan', 'sabi', 'sabihin', 'sarili', 'sila', 'sino', 'siya', 'tatlo', 'tayo', 'tulad', 'tungkol', 'una', 'walang']
Tags
Methods
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation and language detection)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words