Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Spanish specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

This class has a collection of methods for Spanish locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Tags
author

Chris Pollett

Table of Contents

$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
$buffer  : string
Storage used in computing the stem
$r1  : string
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
$r1_index  : int
Position in $word to stem of $r1
$r2  : string
$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
$r2_index  : int
Position in $word to stem of $r2
$rv  : string
$rv is approximately the string after the first vowel in the $word we want to stem
$rv_index  : int
Position in $word to stem of $rv
$vowel  : string
Spanish vowels
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of a French word
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation)
computeRegions()  : mixed
This computes the three regions of the word rv, r1, and r2 used in the rest of the stemmer $rv is defined as follows: If the second letter is a consonant, $rv is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RV is the end of the word if these positions cannot be found.
removeAccents()  : mixed
Un-accent end
step0()  : mixed
Remove attached pronouns
step1()  : mixed
Standard suffix removal
step2a()  : mixed
Stem verb suffixes beginning y
step2b()  : mixed
Stem other verb suffixes
step3()  : mixed
Delete residual suffixes

Properties

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = []

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

public static mixed $stop_words = ["de", "la", "que", "el", "en", "y", "a", "los", "del", "se", "las", "por", "un", "para", "con", "no", "una", "su", "al", "lo", "como", "más", "pero", "sus", "le", "ya", "o", "este", "sí", "porque", "esta", "entre", "cuando", "muy", "sin", "sobre", "también", "me", "hasta", "hay", "donde", "quien", "desde", "todo", "nos", "durante", "todos", "uno", "les", "ni", "contra", "otros", "ese", "eso", "ante", "ellos", "e", "esto", "mí", "antes", "algunos", "qué", "unos", "yo", "otro", "otras", "otra", "él", "tanto", "esa", "estos", "mucho", "quienes", "nada", "muchos", "cual", "poco", "ella", "estar", "estas", "algunas", "algo", "nosotros", "mi", "mis", "tú", "te", "ti", "tu", "tus", "ellas", "nosotras", "vosotros", "vosotras", "os", "mío", "mía", "míos", "mías", "tuyo", "tuya", "tuyos", "tuyas", "suyo", "suya", "suyos", "suyas", "nuestro", "nuestra", "nuestros", "nuestras", "vuestro", "vuestra", "vuestros", "vuestras", "esos", "esas", "estoy", "estás", "está", "estamos", "estáis", "están", "esté", "estés", "estemos", "estéis", "estén", "estaré", "estarás", "estará", "estaremos", "estaréis", "estarán", "estaría", "estarías", "estaríamos", "estaríais", "estarían", "estaba", "estabas", "estábamos", "estabais", "estaban", "estuve", "estuviste", "estuvo", "estuvimos", "estuvisteis", "estuvieron", "estuviera", "estuvieras", "estuviéramos", "estuvierais", "estuvieran", "estuviese", "estuvieses", "estuviésemos", "estuvieseis", "estuviesen", "estando", "estado", "estada", "estados", "estadas", "estad", "he", "has", "ha", "hemos", "habéis", "han", "haya", "hayas", "hayamos", "hayáis", "hayan", "habré", "habrás", "habrá", "habremos", "habréis", "habrán", "habría", "habrías", "habríamos", "habríais", "habrían", "había", "habías", "habíamos", "habíais", 'http', 'https', "habían", "hube", "hubiste", "hubo", "hubimos", "hubisteis", "hubieron", "hubiera", "hubieras", "hubiéramos", "hubierais", "hubieran", "hubiese", "hubieses", "hubiésemos", "hubieseis", "hubiesen", "habiendo", "habido", "habida", "habidos", "habidas", "soy", "eres", "es", "somos", "sois", "son", "sea", "seas", "seamos", "seáis", "sean", "seré", "serás", "será", "seremos", "seréis", "serán", "sería", "serías", "seríamos", "seríais", "serían", "era", "eras", "éramos", "erais", "eran", "fui", "fuiste", "fue", "fuimos", "fuisteis", "fueron", "fuera", "fueras", "fuéramos", "fuerais", "fueran", "fuese", "fueses", "fuésemos", "fueseis", "fuesen", "siendo", "sido", "sed", "tengo", "tienes", "tiene", "tenemos", "tenéis", "tienen", "tenga", "tengas", "tengamos", "tengáis", "tengan", "tendré", "tendrás", "tendrá", "tendremos", "tendréis", "tendrán", "tendría", "tendrías", "tendríamos", "tendríais", "tendrían", "tenía", "tenías", "teníamos", "teníais", "tenían", "tuve", "tuviste", "tuvo", "tuvimos", "tuvisteis", "tuvieron", "tuviera", "tuvieras", "tuviéramos", "tuvierais", "tuvieran", "tuviese", "tuvieses", "tuviésemos", "tuvieseis", "tuviesen", "teniendo", "tenido", "tenida", "tenidos", "tenidas", "tened"]
Tags
array

$buffer

Storage used in computing the stem

private static string $buffer = ""

$r1

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

private static string $r1

$r1_index

Position in $word to stem of $r1

private static int $r1_index

$r2

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

private static string $r2

$r2_index

Position in $word to stem of $r2

private static int $r2_index

$rv

$rv is approximately the string after the first vowel in the $word we want to stem

private static string $rv

$rv_index

Position in $word to stem of $rv

private static int $rv_index

$vowel

Spanish vowels

private static string $vowel = 'aeiouáéíóúü'

Methods

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of a French word

public static stem(string $word) : string
Parameters
$word : string

the string to stem

Return values
string

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

computeRegions()

This computes the three regions of the word rv, r1, and r2 used in the rest of the stemmer $rv is defined as follows: If the second letter is a consonant, $rv is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RV is the end of the word if these positions cannot be found.

private static computeRegions() : mixed

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. $r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

Return values
mixed

removeAccents()

Un-accent end

private static removeAccents() : mixed
Return values
mixed

step0()

Remove attached pronouns

private static step0() : mixed
Return values
mixed

step1()

Standard suffix removal

private static step1() : mixed
Return values
mixed

step2a()

Stem verb suffixes beginning y

private static step2a() : mixed
Return values
mixed

step2b()

Stem other verb suffixes

private static step2b() : mixed
Return values
mixed

step3()

Delete residual suffixes

private static step3() : mixed
Return values
mixed

        

Search results