StochasticTermSegmenter
in package
Class for segmenting terms using Stochastic Finite State Word Segmentation
Tags
Table of Contents
- MAX_TERM_LENGTH = 7
- Maximum character length of a term
- $dictionary : array<string|int, mixed>
- A dictionary that contains statistical information on terms for a language. A non-empty dictionary should have two fields: N, the number of terms in the dictionary; dic, a trie implemented using nested php arrays that implements the dictionary. The leaves of the trie have frequency counts for terms stored in the trie.
- $dictionary_path : string
- Path on disk to where segmentor dictionary should be stored
- $lang : string
- The language currently being used e.g. zh-CN, ja
- $named_entity_tagger : object
- Holds $tokenizer's instance of NamedEntityContextTagger
- $non_char_preg : string
- Regular expression to determine if the non of the char in this term is in current language Recommended expression for: Chinese: \p{Han} Japanese: \x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF} Korean: \x{3130}-\x{318F}\x{AC00}-\x{D7AF}
- $tokenizer : object
- Holds instance of Tokenizer for $lang language
- $unknown_term_score : float
- Default score for any unknown term
- $cache : array<string|int, mixed>
- Cache of sub trie of dictionary trie used to speed up look up
- $cache_pct : number
- Percentage for cache entries. Value should be between 0 and 1.0 Set to small number when running on memory limited machines Here is a general comparison when setting it to 0 and 1: In the test of Chinese Segmentation on pku dataset, the peak usage of memory is 26.288MB vs. 151.46MB The trade off is some efficiency, In the test of Chinese Segmentation on pku dataset, the speed is 43.803s vs. 1.540s Default value = 0.06 The time and Peak Memory are 5.094s and 98.97MB
- __construct() : mixed
- Constructs an instance of this class used for segmenting string with respect to words in a locale using a probabilistic approach to evaluate segmentation possibilities.
- add() : mixed
- Adds a (term, frequency) pair to an array based trie
- getScore() : float
- Calculates a score for a term based on its frequency versus that of the whole trie.
- isException() : true
- Check if the term passed in is an exception term Not all valid terms should be indexed.
- isPunctuation() : true
- Check if the term passed in is a punctuation character isPunctuationImpl should be defined in constructor if needed
- notCurrentLang() : bool
- Check if all the chars in the term are NOT from the current language
- segmentFiles() : string
- Segments the text in a list of files
- segmentSentence() : array<string|int, mixed>
- Segments a single sentence into an array of words.
- segmentText() : string
- Segments text into terms separated by space
- train() : mixed
- Generate a term dictionary file for later segmentation
Constants
MAX_TERM_LENGTH
Maximum character length of a term
public
mixed
MAX_TERM_LENGTH
= 7
Properties
$dictionary
A dictionary that contains statistical information on terms for a language. A non-empty dictionary should have two fields: N, the number of terms in the dictionary; dic, a trie implemented using nested php arrays that implements the dictionary. The leaves of the trie have frequency counts for terms stored in the trie.
public
array<string|int, mixed>
$dictionary
$dictionary_path
Path on disk to where segmentor dictionary should be stored
public
string
$dictionary_path
$lang
The language currently being used e.g. zh-CN, ja
public
string
$lang
$named_entity_tagger
Holds $tokenizer's instance of NamedEntityContextTagger
public
object
$named_entity_tagger
$non_char_preg
Regular expression to determine if the non of the char in this term is in current language Recommended expression for: Chinese: \p{Han} Japanese: \x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF} Korean: \x{3130}-\x{318F}\x{AC00}-\x{D7AF}
public
string
$non_char_preg
$tokenizer
Holds instance of Tokenizer for $lang language
public
object
$tokenizer
$unknown_term_score
Default score for any unknown term
public
float
$unknown_term_score
$cache
Cache of sub trie of dictionary trie used to speed up look up
private
array<string|int, mixed>
$cache
= []
$cache_pct
Percentage for cache entries. Value should be between 0 and 1.0 Set to small number when running on memory limited machines Here is a general comparison when setting it to 0 and 1: In the test of Chinese Segmentation on pku dataset, the peak usage of memory is 26.288MB vs. 151.46MB The trade off is some efficiency, In the test of Chinese Segmentation on pku dataset, the speed is 43.803s vs. 1.540s Default value = 0.06 The time and Peak Memory are 5.094s and 98.97MB
private
number
$cache_pct
from 0 - 1.0
Methods
__construct()
Constructs an instance of this class used for segmenting string with respect to words in a locale using a probabilistic approach to evaluate segmentation possibilities.
public
__construct(string $lang[, float $cache_pct = 0.06 ]) : mixed
Parameters
- $lang : string
-
locale this instance will do segmentation for
- $cache_pct : float = 0.06
-
percentage of whole trie that can be cached for faster look-up
Return values
mixed —add()
Adds a (term, frequency) pair to an array based trie
public
add(string $term, string $frequency, array<string|int, mixed> &$trie) : mixed
Parameters
- $term : string
-
the term to be inserted
- $frequency : string
-
the frequency to be inserted
- $trie : array<string|int, mixed>
-
array based trie we want to insert the key value pair into
Return values
mixed —getScore()
Calculates a score for a term based on its frequency versus that of the whole trie.
public
getScore(int $frequency) : float
Parameters
- $frequency : int
-
is an integer tells the frequency of a word
Return values
float —the score of the term.
isException()
Check if the term passed in is an exception term Not all valid terms should be indexed.
public
isException(string $term) : true
e.g. there are infinite combinations of numbers in the world. isExceptionImpl should be defined in constructor if needed
Parameters
- $term : string
-
is a string that to be checked
Return values
true —if $term is an exception term, false otherwise
isPunctuation()
Check if the term passed in is a punctuation character isPunctuationImpl should be defined in constructor if needed
public
isPunctuation(string $term) : true
Parameters
- $term : string
-
is a string that to be checked
Return values
true —if $term is some kind of punctuation, false otherwise
notCurrentLang()
Check if all the chars in the term are NOT from the current language
public
notCurrentLang(string $term) : bool
Parameters
- $term : string
-
is a string that to be checked
Return values
bool —true if all the chars in $term are NOT from the current language false otherwise
segmentFiles()
Segments the text in a list of files
public
segmentFiles(mixed $text_files[, bool $return_string = false ]) : string
Parameters
- $text_files : mixed
-
can be a file name or a list of file names to be segmented
- $return_string : bool = false
-
return segmented string if true, print to stdout otherwise user can use > filename to output it to a file
Return values
string —segmented words with space or true/false;
segmentSentence()
Segments a single sentence into an array of words.
public
segmentSentence(string $sentence) : array<string|int, mixed>
Must NOT contain any new line characters.
Parameters
- $sentence : string
-
is a string without newline to be segmented
Return values
array<string|int, mixed> —of segmented words
segmentText()
Segments text into terms separated by space
public
segmentText(string $text[, string $normalize = false ]) : string
Parameters
- $text : string
-
to be segmented
- $normalize : string = false
-
return the normalized form 乾隆->干隆
Return values
string —segmented terms with space
train()
Generate a term dictionary file for later segmentation
public
train(mixed $text_files[, string $format = "default" ]) : mixed
Parameters
- $text_files : mixed
-
is a string name or an array of files that to be trained; words in the files need to be segmented by space
- $format : string = "default"
-
currently only support default and CTB