Tokenizer
in package
Spanish specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram
This class has a collection of methods for Spanish locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
Tags
Table of Contents
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
- $buffer : string
- Storage used in computing the stem
- $r1 : string
- $r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
- $r1_index : int
- Position in $word to stem of $r1
- $r2 : string
- $r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
- $r2_index : int
- Position in $word to stem of $r2
- $rv : string
- $rv is approximately the string after the first vowel in the $word we want to stem
- $rv_index : int
- Position in $word to stem of $rv
- $vowel : string
- Spanish vowels
- segment() : string
- Stub function which could be used for a word segmenter.
- stem() : string
- Computes the stem of a French word
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation)
- computeRegions() : mixed
- This computes the three regions of the word rv, r1, and r2 used in the rest of the stemmer $rv is defined as follows: If the second letter is a consonant, $rv is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RV is the end of the word if these positions cannot be found.
- removeAccents() : mixed
- Un-accent end
- step0() : mixed
- Remove attached pronouns
- step1() : mixed
- Standard suffix removal
- step2a() : mixed
- Stem verb suffixes beginning y
- step2b() : mixed
- Stem other verb suffixes
- step3() : mixed
- Delete residual suffixes
Properties
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= []
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
public
static mixed
$stop_words
= ["de", "la", "que", "el", "en", "y", "a", "los", "del", "se", "las", "por", "un", "para", "con", "no", "una", "su", "al", "lo", "como", "más", "pero", "sus", "le", "ya", "o", "este", "sí", "porque", "esta", "entre", "cuando", "muy", "sin", "sobre", "también", "me", "hasta", "hay", "donde", "quien", "desde", "todo", "nos", "durante", "todos", "uno", "les", "ni", "contra", "otros", "ese", "eso", "ante", "ellos", "e", "esto", "mí", "antes", "algunos", "qué", "unos", "yo", "otro", "otras", "otra", "él", "tanto", "esa", "estos", "mucho", "quienes", "nada", "muchos", "cual", "poco", "ella", "estar", "estas", "algunas", "algo", "nosotros", "mi", "mis", "tú", "te", "ti", "tu", "tus", "ellas", "nosotras", "vosotros", "vosotras", "os", "mío", "mía", "míos", "mías", "tuyo", "tuya", "tuyos", "tuyas", "suyo", "suya", "suyos", "suyas", "nuestro", "nuestra", "nuestros", "nuestras", "vuestro", "vuestra", "vuestros", "vuestras", "esos", "esas", "estoy", "estás", "está", "estamos", "estáis", "están", "esté", "estés", "estemos", "estéis", "estén", "estaré", "estarás", "estará", "estaremos", "estaréis", "estarán", "estaría", "estarías", "estaríamos", "estaríais", "estarían", "estaba", "estabas", "estábamos", "estabais", "estaban", "estuve", "estuviste", "estuvo", "estuvimos", "estuvisteis", "estuvieron", "estuviera", "estuvieras", "estuviéramos", "estuvierais", "estuvieran", "estuviese", "estuvieses", "estuviésemos", "estuvieseis", "estuviesen", "estando", "estado", "estada", "estados", "estadas", "estad", "he", "has", "ha", "hemos", "habéis", "han", "haya", "hayas", "hayamos", "hayáis", "hayan", "habré", "habrás", "habrá", "habremos", "habréis", "habrán", "habría", "habrías", "habríamos", "habríais", "habrían", "había", "habías", "habíamos", "habíais", 'http', 'https', "habían", "hube", "hubiste", "hubo", "hubimos", "hubisteis", "hubieron", "hubiera", "hubieras", "hubiéramos", "hubierais", "hubieran", "hubiese", "hubieses", "hubiésemos", "hubieseis", "hubiesen", "habiendo", "habido", "habida", "habidos", "habidas", "soy", "eres", "es", "somos", "sois", "son", "sea", "seas", "seamos", "seáis", "sean", "seré", "serás", "será", "seremos", "seréis", "serán", "sería", "serías", "seríamos", "seríais", "serían", "era", "eras", "éramos", "erais", "eran", "fui", "fuiste", "fue", "fuimos", "fuisteis", "fueron", "fuera", "fueras", "fuéramos", "fuerais", "fueran", "fuese", "fueses", "fuésemos", "fueseis", "fuesen", "siendo", "sido", "sed", "tengo", "tienes", "tiene", "tenemos", "tenéis", "tienen", "tenga", "tengas", "tengamos", "tengáis", "tengan", "tendré", "tendrás", "tendrá", "tendremos", "tendréis", "tendrán", "tendría", "tendrías", "tendríamos", "tendríais", "tendrían", "tenía", "tenías", "teníamos", "teníais", "tenían", "tuve", "tuviste", "tuvo", "tuvimos", "tuvisteis", "tuvieron", "tuviera", "tuvieras", "tuviéramos", "tuvierais", "tuvieran", "tuviese", "tuvieses", "tuviésemos", "tuvieseis", "tuviesen", "teniendo", "tenido", "tenida", "tenidos", "tenidas", "tened"]
Tags
$buffer
Storage used in computing the stem
private
static string
$buffer
= ""
$r1
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
private
static string
$r1
$r1_index
Position in $word to stem of $r1
private
static int
$r1_index
$r2
$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
private
static string
$r2
$r2_index
Position in $word to stem of $r2
private
static int
$r2_index
$rv
$rv is approximately the string after the first vowel in the $word we want to stem
private
static string
$rv
$rv_index
Position in $word to stem of $rv
private
static int
$rv_index
$vowel
Spanish vowels
private
static string
$vowel
= 'aeiouáéíóúü'
Methods
segment()
Stub function which could be used for a word segmenter.
public
static segment(string $pre_segment) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
Return values
string —should return string with words separated by space in this case does nothing
stem()
Computes the stem of a French word
public
static stem(string $word) : string
Parameters
- $word : string
-
the string to stem
Return values
string —the stem of $words
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
computeRegions()
This computes the three regions of the word rv, r1, and r2 used in the rest of the stemmer $rv is defined as follows: If the second letter is a consonant, $rv is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RV is the end of the word if these positions cannot be found.
private
static computeRegions() : mixed
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. $r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel
Return values
mixed —removeAccents()
Un-accent end
private
static removeAccents() : mixed
Return values
mixed —step0()
Remove attached pronouns
private
static step0() : mixed
Return values
mixed —step1()
Standard suffix removal
private
static step1() : mixed
Return values
mixed —step2a()
Stem verb suffixes beginning y
private
static step2a() : mixed
Return values
mixed —step2b()
Stem other verb suffixes
private
static step2b() : mixed
Return values
mixed —step3()
Delete residual suffixes
private
static step3() : mixed