Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Italian specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

Tags
author

Chris Pollett

Table of Contents

$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$stop_words  : array<string|int, mixed>
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
$buffer  : string
Storage used in computing the stem
$max_suffix_pos  : int
Storage for computing the starting position for the longest suffix
$r1_start  : int
Storage used in computing the starting index of region R1
$r1_string  : string
Storage used in computing region R1
$r2_start  : int
Storage used in computing the starting index of region R2
$r2_string  : string
Storage used in computing region R2
$rv_start  : int
Storage used in computing the starting index of region RV
$rv_string  : string
Storage used in computing Region RV
$step1_changes  : bool
Storage used in determinig if step1 removed any endings from the word
segment()  : string
This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words separated: "this is a string"
stem()  : string
Computes the stem of an Italian word Example guardando,guardandogli,guardandola,guardano all stem to guard
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation)
acuteByGrave()  : string
Replaces all acute accents in a string by grave accents and also handles accented characters
checkForSuffix()  : int
Checks if a string is a suffix for another string
getRegions()  : mixed
Computes regions R1, R2 and RV in the form strings. $r1_string, $r2_string, $r3_string for R1,R2 and R3 repectively
in()  : bool
Checks if a string occurs in another string
isVowel()  : bool
Checks if a character is a vowel or not
maxSuffix()  : int
Computes the longest suffix for a given string from a given set of suffixes
postlude()  : mixed
Converts U and/or I back to lowercase
prelude()  : mixed
Performs the following functions: Replaces acute accents with grave accents Marks u after q and u,i preceded and followed by a vowel as a non vowel by converting to upper case
r1()  : int
Computes the starting index for region R1
r2()  : int
Computes the starting index for region R2
rv()  : int
Computes the starting index for region RV
step0()  : mixed
Handles attached pronoun
step1()  : mixed
Handles standard suffixes
step2()  : mixed
Handles verb suffixes
step3a()  : mixed
Deletes a final a,e,i,o,a`,e`,i`,o` and a preceding i if in RV
step3b()  : mixed
Replaces a final ch/gh by c/g if in RV

Properties

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = []

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

public static array<string|int, mixed> $stop_words = ['http', 'https', "ad", "al", "allo", "ai", "agli", "all", "agl", "alla", "alle", "con", "col", "coi", "da", "dal", "dallo", "dai", "dagli", "dall", "dagl", "dalla", "dalle", "di", "del", "dello", "dei", "degli", "dell", "degl", "della", "delle", "in", "nel", "nello", "nei", "negli", "nell", "negl", "nella", "nelle", "su", "sul", "sullo", "sui", "sugli", "sull", "sugl", "sulla", "sulle", "per", "tra", "contro", "io", "tu", "lui", "lei", "noi", "voi", "loro", "mio", "mia", "miei", "mie", "tuo", "tua", "tuoi", "tue", "suo", "sua", "suoi", "sue", "nostro", "nostra", "nostri", "nostre", "vostro", "vostra", "vostri", "vostre", "mi", "ti", "ci", "vi", "lo", "la", "li", "le", "gli", "ne", "il", "un", "uno", "una", "ma", "ed", "se", "perché", "anche", "come", "dov", "dove", "che", "chi", "cui", "non", "più", "quale", "quanto", "quanti", "quanta", "quante", "quello", "quelli", "quella", "quelle", "questo", "questi", "questa", "queste", "si", "tutto", "tutti", "a", "c", "e", "i", "l", "o", "ho", "hai", "ha", "abbiamo", "avete", "hanno", "abbia", "abbiate", "abbiano", "avrò", "avrai", "avrà", "avremo", "avrete", "avranno", "avrei", "avresti", "avrebbe", "avremmo", "avreste", "avrebbero", "avevo", "avevi", "aveva", "avevamo", "avevate", "avevano", "ebbi", "avesti", "ebbe", "avemmo", "aveste", "ebbero", "avessi", "avesse", "avessimo", "avessero", "avendo", "avuto", "avuta", "avuti", "avute", "sono", "sei", "è", "siamo", "siete", "sia", "siate", "siano", "sarò", "sarai", "sarà", "saremo", "sarete", "saranno", "sarei", "saresti", "sarebbe", "saremmo", "sareste", "sarebbero", "ero", "eri", "era", "eravamo", "eravate", "erano", "fui", "fosti", "fu", "fummo", "foste", "furono", "fossi", "fosse", "fossimo", "fossero", "essendo", "faccio", "fai", "facciamo", "fanno", "faccia", "facciate", "facciano", "farò", "farai", "farà", "faremo", "farete", "faranno", "farei", "faresti", "farebbe", "faremmo", "fareste", "farebbero", "facevo", "facevi", "faceva", "facevamo", "facevate", "facevano", "feci", "facesti", "fece", "facemmo", "faceste", "fecero", "facessi", "facesse", "facessimo", "facessero", "facendo", "sto", "stai", "sta", "stiamo", "stanno", "stia", "stiate", "stiano", "starò", "starai", "starà", "staremo", "starete", "staranno", "starei", "staresti", "starebbe", "staremmo", "stareste", "starebbero", "stavo", "stavi", "stava", "stavamo", "stavate", "stavano", "stetti", "stesti", "stette", "stemmo", "steste", "stettero", "stessi", "stesse", "stessimo", "stessero", "stando"]

$buffer

Storage used in computing the stem

private static string $buffer

$max_suffix_pos

Storage for computing the starting position for the longest suffix

private static int $max_suffix_pos

$r1_start

Storage used in computing the starting index of region R1

private static int $r1_start

$r1_string

Storage used in computing region R1

private static string $r1_string

$r2_start

Storage used in computing the starting index of region R2

private static int $r2_start

$r2_string

Storage used in computing region R2

private static string $r2_string

$rv_start

Storage used in computing the starting index of region RV

private static int $rv_start

$rv_string

Storage used in computing Region RV

private static string $rv_string

$step1_changes

Storage used in determinig if step1 removed any endings from the word

private static bool $step1_changes

Methods

segment()

This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words separated: "this is a string"

public static segment(string $pre_segment) : string
Parameters
$pre_segment : string

string to be segmented

Return values
string

after segmentation done (same string in this case)

stem()

Computes the stem of an Italian word Example guardando,guardandogli,guardandola,guardano all stem to guard

public static stem(string $word) : string
Parameters
$word : string

is the word to be stemmed

Return values
string

stem of $word

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

acuteByGrave()

Replaces all acute accents in a string by grave accents and also handles accented characters

private static acuteByGrave(string $string) : string
Parameters
$string : string

in which the acute accents are to be replaced

Return values
string

with changes

checkForSuffix()

Checks if a string is a suffix for another string

private static checkForSuffix( $parent_string,  $substring) : int
Parameters
$parent_string :

is the string in which we wish to find the suffix

$substring :

is the suffix we wish to check

Return values
int

$pos as the starting position of the suffix $substring in $parent_string if it exists, else false

getRegions()

Computes regions R1, R2 and RV in the form strings. $r1_string, $r2_string, $r3_string for R1,R2 and R3 repectively

private static getRegions() : mixed
Return values
mixed

in()

Checks if a string occurs in another string

private static in(string $string, string $substring) : bool
Parameters
$string : string

is the parent string

$substring : string

is the string checked to be a sub-string of $string

Return values
bool

if $substring is a substring of $string

isVowel()

Checks if a character is a vowel or not

private static isVowel(string $char) : bool
Parameters
$char : string

is the character to be checked

Return values
bool

if $char is a vowel

maxSuffix()

Computes the longest suffix for a given string from a given set of suffixes

private static maxSuffix(string $string, array<string|int, mixed> $suffixes) : int
Parameters
$string : string

for which the maximum suffix is to be found

$suffixes : array<string|int, mixed>

an array of suffixes

Return values
int

$max_suffix is the longest suffix for $string

postlude()

Converts U and/or I back to lowercase

private static postlude() : mixed
Return values
mixed

prelude()

Performs the following functions: Replaces acute accents with grave accents Marks u after q and u,i preceded and followed by a vowel as a non vowel by converting to upper case

private static prelude() : mixed
Return values
mixed

r1()

Computes the starting index for region R1

private static r1(string $string) : int
Parameters
$string : string

for which we wish to find the index

Return values
int

$r1_start as the starting index for R1 for $string

r2()

Computes the starting index for region R2

private static r2(string $string) : int
Parameters
$string : string

for which we wish to find the index

Return values
int

$r2_start as the starting index for R1 for $string

rv()

Computes the starting index for region RV

private static rv(string $string) : int
Parameters
$string : string

for which we wish to find the index

Return values
int

$rv_start as the starting index for RV for $string

step0()

Handles attached pronoun

private static step0() : mixed
Return values
mixed

step1()

Handles standard suffixes

private static step1() : mixed
Return values
mixed

step2()

Handles verb suffixes

private static step2() : mixed
Return values
mixed

step3a()

Deletes a final a,e,i,o,a`,e`,i`,o` and a preceding i if in RV

private static step3a() : mixed
Return values
mixed

step3b()

Replaces a final ch/gh by c/g if in RV

private static step3b() : mixed
Return values
mixed

        

Search results