Tokenizer
in package
Italian specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram
Tags
Table of Contents
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $stop_words : array<string|int, mixed>
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
- $buffer : string
- Storage used in computing the stem
- $max_suffix_pos : int
- Storage for computing the starting position for the longest suffix
- $r1_start : int
- Storage used in computing the starting index of region R1
- $r1_string : string
- Storage used in computing region R1
- $r2_start : int
- Storage used in computing the starting index of region R2
- $r2_string : string
- Storage used in computing region R2
- $rv_start : int
- Storage used in computing the starting index of region RV
- $rv_string : string
- Storage used in computing Region RV
- $step1_changes : bool
- Storage used in determinig if step1 removed any endings from the word
- segment() : string
- This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words separated: "this is a string"
- stem() : string
- Computes the stem of an Italian word Example guardando,guardandogli,guardandola,guardano all stem to guard
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation)
- acuteByGrave() : string
- Replaces all acute accents in a string by grave accents and also handles accented characters
- checkForSuffix() : int
- Checks if a string is a suffix for another string
- getRegions() : mixed
- Computes regions R1, R2 and RV in the form strings. $r1_string, $r2_string, $r3_string for R1,R2 and R3 repectively
- in() : bool
- Checks if a string occurs in another string
- isVowel() : bool
- Checks if a character is a vowel or not
- maxSuffix() : int
- Computes the longest suffix for a given string from a given set of suffixes
- postlude() : mixed
- Converts U and/or I back to lowercase
- prelude() : mixed
- Performs the following functions: Replaces acute accents with grave accents Marks u after q and u,i preceded and followed by a vowel as a non vowel by converting to upper case
- r1() : int
- Computes the starting index for region R1
- r2() : int
- Computes the starting index for region R2
- rv() : int
- Computes the starting index for region RV
- step0() : mixed
- Handles attached pronoun
- step1() : mixed
- Handles standard suffixes
- step2() : mixed
- Handles verb suffixes
- step3a() : mixed
- Deletes a final a,e,i,o,a`,e`,i`,o` and a preceding i if in RV
- step3b() : mixed
- Replaces a final ch/gh by c/g if in RV
Properties
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= []
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
public
static array<string|int, mixed>
$stop_words
= ['http', 'https', "ad", "al", "allo", "ai", "agli", "all", "agl", "alla", "alle", "con", "col", "coi", "da", "dal", "dallo", "dai", "dagli", "dall", "dagl", "dalla", "dalle", "di", "del", "dello", "dei", "degli", "dell", "degl", "della", "delle", "in", "nel", "nello", "nei", "negli", "nell", "negl", "nella", "nelle", "su", "sul", "sullo", "sui", "sugli", "sull", "sugl", "sulla", "sulle", "per", "tra", "contro", "io", "tu", "lui", "lei", "noi", "voi", "loro", "mio", "mia", "miei", "mie", "tuo", "tua", "tuoi", "tue", "suo", "sua", "suoi", "sue", "nostro", "nostra", "nostri", "nostre", "vostro", "vostra", "vostri", "vostre", "mi", "ti", "ci", "vi", "lo", "la", "li", "le", "gli", "ne", "il", "un", "uno", "una", "ma", "ed", "se", "perché", "anche", "come", "dov", "dove", "che", "chi", "cui", "non", "più", "quale", "quanto", "quanti", "quanta", "quante", "quello", "quelli", "quella", "quelle", "questo", "questi", "questa", "queste", "si", "tutto", "tutti", "a", "c", "e", "i", "l", "o", "ho", "hai", "ha", "abbiamo", "avete", "hanno", "abbia", "abbiate", "abbiano", "avrò", "avrai", "avrà", "avremo", "avrete", "avranno", "avrei", "avresti", "avrebbe", "avremmo", "avreste", "avrebbero", "avevo", "avevi", "aveva", "avevamo", "avevate", "avevano", "ebbi", "avesti", "ebbe", "avemmo", "aveste", "ebbero", "avessi", "avesse", "avessimo", "avessero", "avendo", "avuto", "avuta", "avuti", "avute", "sono", "sei", "è", "siamo", "siete", "sia", "siate", "siano", "sarò", "sarai", "sarà", "saremo", "sarete", "saranno", "sarei", "saresti", "sarebbe", "saremmo", "sareste", "sarebbero", "ero", "eri", "era", "eravamo", "eravate", "erano", "fui", "fosti", "fu", "fummo", "foste", "furono", "fossi", "fosse", "fossimo", "fossero", "essendo", "faccio", "fai", "facciamo", "fanno", "faccia", "facciate", "facciano", "farò", "farai", "farà", "faremo", "farete", "faranno", "farei", "faresti", "farebbe", "faremmo", "fareste", "farebbero", "facevo", "facevi", "faceva", "facevamo", "facevate", "facevano", "feci", "facesti", "fece", "facemmo", "faceste", "fecero", "facessi", "facesse", "facessimo", "facessero", "facendo", "sto", "stai", "sta", "stiamo", "stanno", "stia", "stiate", "stiano", "starò", "starai", "starà", "staremo", "starete", "staranno", "starei", "staresti", "starebbe", "staremmo", "stareste", "starebbero", "stavo", "stavi", "stava", "stavamo", "stavate", "stavano", "stetti", "stesti", "stette", "stemmo", "steste", "stettero", "stessi", "stesse", "stessimo", "stessero", "stando"]
$buffer
Storage used in computing the stem
private
static string
$buffer
$max_suffix_pos
Storage for computing the starting position for the longest suffix
private
static int
$max_suffix_pos
$r1_start
Storage used in computing the starting index of region R1
private
static int
$r1_start
$r1_string
Storage used in computing region R1
private
static string
$r1_string
$r2_start
Storage used in computing the starting index of region R2
private
static int
$r2_start
$r2_string
Storage used in computing region R2
private
static string
$r2_string
$rv_start
Storage used in computing the starting index of region RV
private
static int
$rv_start
$rv_string
Storage used in computing Region RV
private
static string
$rv_string
$step1_changes
Storage used in determinig if step1 removed any endings from the word
private
static bool
$step1_changes
Methods
segment()
This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words separated: "this is a string"
public
static segment(string $pre_segment) : string
Parameters
- $pre_segment : string
-
string to be segmented
Return values
string —after segmentation done (same string in this case)
stem()
Computes the stem of an Italian word Example guardando,guardandogli,guardandola,guardano all stem to guard
public
static stem(string $word) : string
Parameters
- $word : string
-
is the word to be stemmed
Return values
string —stem of $word
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
acuteByGrave()
Replaces all acute accents in a string by grave accents and also handles accented characters
private
static acuteByGrave(string $string) : string
Parameters
- $string : string
-
in which the acute accents are to be replaced
Return values
string —with changes
checkForSuffix()
Checks if a string is a suffix for another string
private
static checkForSuffix( $parent_string, $substring) : int
Parameters
- $parent_string :
-
is the string in which we wish to find the suffix
- $substring :
-
is the suffix we wish to check
Return values
int —$pos as the starting position of the suffix $substring in $parent_string if it exists, else false
getRegions()
Computes regions R1, R2 and RV in the form strings. $r1_string, $r2_string, $r3_string for R1,R2 and R3 repectively
private
static getRegions() : mixed
Return values
mixed —in()
Checks if a string occurs in another string
private
static in(string $string, string $substring) : bool
Parameters
- $string : string
-
is the parent string
- $substring : string
-
is the string checked to be a sub-string of $string
Return values
bool —if $substring is a substring of $string
isVowel()
Checks if a character is a vowel or not
private
static isVowel(string $char) : bool
Parameters
- $char : string
-
is the character to be checked
Return values
bool —if $char is a vowel
maxSuffix()
Computes the longest suffix for a given string from a given set of suffixes
private
static maxSuffix(string $string, array<string|int, mixed> $suffixes) : int
Parameters
- $string : string
-
for which the maximum suffix is to be found
- $suffixes : array<string|int, mixed>
-
an array of suffixes
Return values
int —$max_suffix is the longest suffix for $string
postlude()
Converts U and/or I back to lowercase
private
static postlude() : mixed
Return values
mixed —prelude()
Performs the following functions: Replaces acute accents with grave accents Marks u after q and u,i preceded and followed by a vowel as a non vowel by converting to upper case
private
static prelude() : mixed
Return values
mixed —r1()
Computes the starting index for region R1
private
static r1(string $string) : int
Parameters
- $string : string
-
for which we wish to find the index
Return values
int —$r1_start as the starting index for R1 for $string
r2()
Computes the starting index for region R2
private
static r2(string $string) : int
Parameters
- $string : string
-
for which we wish to find the index
Return values
int —$r2_start as the starting index for R1 for $string
rv()
Computes the starting index for region RV
private
static rv(string $string) : int
Parameters
- $string : string
-
for which we wish to find the index
Return values
int —$rv_start as the starting index for RV for $string
step0()
Handles attached pronoun
private
static step0() : mixed
Return values
mixed —step1()
Handles standard suffixes
private
static step1() : mixed
Return values
mixed —step2()
Handles verb suffixes
private
static step2() : mixed
Return values
mixed —step3a()
Deletes a final a,e,i,o,a`,e`,i`,o` and a preceding i if in RV
private
static step3a() : mixed
Return values
mixed —step3b()
Replaces a final ch/gh by c/g if in RV
private
static step3b() : mixed