Yioop_V9.5_Source_Code_Documentation

StochasticTermSegmenter
in package

Class for segmenting terms using Stochastic Finite State Word Segmentation

Tags
author

Xianghong Sun and Chris Pollett (tweaks to adding new language)

Table of Contents

MAX_TERM_LENGTH  = 7
Maximum character length of a term
$dictionary  : array<string|int, mixed>
A dictionary that contains statistical information on terms for a language. A non-empty dictionary should have two fields: N, the number of terms in the dictionary; dic, a trie implemented using nested php arrays that implements the dictionary. The leaves of the trie have frequency counts for terms stored in the trie.
$dictionary_path  : string
Path on disk to where segmentor dictionary should be stored
$lang  : string
The language currently being used e.g. zh-CN, ja
$named_entity_tagger  : object
Holds $tokenizer's instance of NamedEntityContextTagger
$non_char_preg  : string
Regular expression to determine if the non of the char in this term is in current language Recommended expression for: Chinese: \p{Han} Japanese: \x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF} Korean: \x{3130}-\x{318F}\x{AC00}-\x{D7AF}
$tokenizer  : object
Holds instance of Tokenizer for $lang language
$unknown_term_score  : float
Default score for any unknown term
$cache  : array<string|int, mixed>
Cache of sub trie of dictionary trie used to speed up look up
$cache_pct  : number
Percentage for cache entries. Value should be between 0 and 1.0 Set to small number when running on memory limited machines Here is a general comparison when setting it to 0 and 1: In the test of Chinese Segmentation on pku dataset, the peak usage of memory is 26.288MB vs. 151.46MB The trade off is some efficiency, In the test of Chinese Segmentation on pku dataset, the speed is 43.803s vs. 1.540s Default value = 0.06 The time and Peak Memory are 5.094s and 98.97MB
__construct()  : mixed
Constructs an instance of this class used for segmenting string with respect to words in a locale using a probabilistic approach to evaluate segmentation possibilities.
add()  : mixed
Adds a (term, frequency) pair to an array based trie
getScore()  : float
Calculates a score for a term based on its frequency versus that of the whole trie.
isException()  : true
Check if the term passed in is an exception term Not all valid terms should be indexed.
isPunctuation()  : true
Check if the term passed in is a punctuation character isPunctuationImpl should be defined in constructor if needed
notCurrentLang()  : bool
Check if all the chars in the term are NOT from the current language
segmentFiles()  : string
Segments the text in a list of files
segmentSentence()  : array<string|int, mixed>
Segments a single sentence into an array of words.
segmentText()  : string
Segments text into terms separated by space
train()  : mixed
Generate a term dictionary file for later segmentation

Constants

Properties

$dictionary

A dictionary that contains statistical information on terms for a language. A non-empty dictionary should have two fields: N, the number of terms in the dictionary; dic, a trie implemented using nested php arrays that implements the dictionary. The leaves of the trie have frequency counts for terms stored in the trie.

public array<string|int, mixed> $dictionary

$dictionary_path

Path on disk to where segmentor dictionary should be stored

public string $dictionary_path

$named_entity_tagger

Holds $tokenizer's instance of NamedEntityContextTagger

public object $named_entity_tagger

$non_char_preg

Regular expression to determine if the non of the char in this term is in current language Recommended expression for: Chinese: \p{Han} Japanese: \x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF} Korean: \x{3130}-\x{318F}\x{AC00}-\x{D7AF}

public string $non_char_preg

$cache

Cache of sub trie of dictionary trie used to speed up look up

private array<string|int, mixed> $cache = []

$cache_pct

Percentage for cache entries. Value should be between 0 and 1.0 Set to small number when running on memory limited machines Here is a general comparison when setting it to 0 and 1: In the test of Chinese Segmentation on pku dataset, the peak usage of memory is 26.288MB vs. 151.46MB The trade off is some efficiency, In the test of Chinese Segmentation on pku dataset, the speed is 43.803s vs. 1.540s Default value = 0.06 The time and Peak Memory are 5.094s and 98.97MB

private number $cache_pct

from 0 - 1.0

Methods

__construct()

Constructs an instance of this class used for segmenting string with respect to words in a locale using a probabilistic approach to evaluate segmentation possibilities.

public __construct(string $lang[, float $cache_pct = 0.06 ]) : mixed
Parameters
$lang : string

locale this instance will do segmentation for

$cache_pct : float = 0.06

percentage of whole trie that can be cached for faster look-up

Return values
mixed

add()

Adds a (term, frequency) pair to an array based trie

public add(string $term, string $frequency, array<string|int, mixed> &$trie) : mixed
Parameters
$term : string

the term to be inserted

$frequency : string

the frequency to be inserted

$trie : array<string|int, mixed>

array based trie we want to insert the key value pair into

Return values
mixed

getScore()

Calculates a score for a term based on its frequency versus that of the whole trie.

public getScore(int $frequency) : float
Parameters
$frequency : int

is an integer tells the frequency of a word

Return values
float

the score of the term.

isException()

Check if the term passed in is an exception term Not all valid terms should be indexed.

public isException(string $term) : true

e.g. there are infinite combinations of numbers in the world. isExceptionImpl should be defined in constructor if needed

Parameters
$term : string

is a string that to be checked

Return values
true

if $term is an exception term, false otherwise

isPunctuation()

Check if the term passed in is a punctuation character isPunctuationImpl should be defined in constructor if needed

public isPunctuation(string $term) : true
Parameters
$term : string

is a string that to be checked

Return values
true

if $term is some kind of punctuation, false otherwise

notCurrentLang()

Check if all the chars in the term are NOT from the current language

public notCurrentLang(string $term) : bool
Parameters
$term : string

is a string that to be checked

Return values
bool

true if all the chars in $term are NOT from the current language false otherwise

segmentFiles()

Segments the text in a list of files

public segmentFiles(mixed $text_files[, bool $return_string = false ]) : string
Parameters
$text_files : mixed

can be a file name or a list of file names to be segmented

$return_string : bool = false

return segmented string if true, print to stdout otherwise user can use > filename to output it to a file

Return values
string

segmented words with space or true/false;

segmentSentence()

Segments a single sentence into an array of words.

public segmentSentence(string $sentence) : array<string|int, mixed>

Must NOT contain any new line characters.

Parameters
$sentence : string

is a string without newline to be segmented

Return values
array<string|int, mixed>

of segmented words

segmentText()

Segments text into terms separated by space

public segmentText(string $text[, string $normalize = false ]) : string
Parameters
$text : string

to be segmented

$normalize : string = false

return the normalized form 乾隆->干隆

Return values
string

segmented terms with space

train()

Generate a term dictionary file for later segmentation

public train(mixed $text_files[, string $format = "default" ]) : mixed
Parameters
$text_files : mixed

is a string name or an array of files that to be trained; words in the files need to be segmented by space

$format : string = "default"

currently only support default and CTB

Return values
mixed

        

Search results