Yioop_V9.5_Source_Code

StochasticTermSegmenter
in package

Application

Class for segmenting terms using Stochastic Finite State Word Segmentation

MAX_TERM_LENGTH

Maximum character length of a term


    public
        mixed
    MAX_TERM_LENGTH
    = 7

$dictionary

A dictionary that contains statistical information on terms for a language. A non-empty dictionary should have two fields: N, the number of terms in the dictionary; dic, a trie implemented using nested php arrays that implements the dictionary. The leaves of the trie have frequency counts for terms stored in the trie.


    public
        array<string|int, mixed>
    $dictionary

$dictionary_path

Path on disk to where segmentor dictionary should be stored


    public
        string
    $dictionary_path

$lang

The language currently being used e.g. zh-CN, ja


    public
        string
    $lang

$named_entity_tagger

Holds $tokenizer's instance of NamedEntityContextTagger


    public
        object
    $named_entity_tagger

$non_char_preg

Regular expression to determine if the non of the char in this term is in current language Recommended expression for: Chinese: \p{Han} Japanese: \x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF} Korean: \x{3130}-\x{318F}\x{AC00}-\x{D7AF}


    public
        string
    $non_char_preg

$tokenizer

Holds instance of Tokenizer for $lang language


    public
        object
    $tokenizer

$unknown_term_score

Default score for any unknown term


    public
        float
    $unknown_term_score

$cache

Cache of sub trie of dictionary trie used to speed up look up


    private
        array<string|int, mixed>
    $cache
     = []

$cache_pct

Percentage for cache entries. Value should be between 0 and 1.0 Set to small number when running on memory limited machines Here is a general comparison when setting it to 0 and 1: In the test of Chinese Segmentation on pku dataset, the peak usage of memory is 26.288MB vs. 151.46MB The trade off is some efficiency, In the test of Chinese Segmentation on pku dataset, the speed is 43.803s vs. 1.540s Default value = 0.06 The time and Peak Memory are 5.094s and 98.97MB


    private
        number
    $cache_pct

from 0 - 1.0

__construct()

Constructs an instance of this class used for segmenting string with respect to words in a locale using a probabilistic approach to evaluate segmentation possibilities.


    public
                    __construct(string $lang[, float $cache_pct = 0.06 ]) : mixed

Parameters

$lang : string: locale this instance will do segmentation for
$cache_pct : float = 0.06: percentage of whole trie that can be cached for faster look-up

Return values

mixed —

add()

Adds a (term, frequency) pair to an array based trie


    public
                    add(string $term, string $frequency, array<string|int, mixed> &$trie) : mixed

Parameters

$term : string: the term to be inserted
$frequency : string: the frequency to be inserted
$trie : array<string|int, mixed>: array based trie we want to insert the key value pair into

Return values

mixed —

getScore()

Calculates a score for a term based on its frequency versus that of the whole trie.


    public
                    getScore(int $frequency) : float

Parameters

$frequency : int: is an integer tells the frequency of a word

Return values

float —

the score of the term.

isException()

Check if the term passed in is an exception term Not all valid terms should be indexed.


    public
                    isException(string $term) : true

e.g. there are infinite combinations of numbers in the world. isExceptionImpl should be defined in constructor if needed

Parameters

$term : string: is a string that to be checked

Return values

true —

if $term is an exception term, false otherwise

isPunctuation()

Check if the term passed in is a punctuation character isPunctuationImpl should be defined in constructor if needed


    public
                    isPunctuation(string $term) : true

Parameters

$term : string: is a string that to be checked

Return values

true —

if $term is some kind of punctuation, false otherwise

notCurrentLang()

Check if all the chars in the term are NOT from the current language


    public
                    notCurrentLang(string $term) : bool

Parameters

$term : string: is a string that to be checked

Return values

bool —

true if all the chars in $term are NOT from the current language false otherwise

segmentFiles()

Segments the text in a list of files


    public
                    segmentFiles(mixed $text_files[, bool $return_string = false ]) : string

Parameters

$text_files : mixed: can be a file name or a list of file names to be segmented
$return_string : bool = false: return segmented string if true, print to stdout otherwise user can use > filename to output it to a file

Return values

string —

segmented words with space or true/false;

segmentSentence()

Segments a single sentence into an array of words.


    public
                    segmentSentence(string $sentence) : array<string|int, mixed>

Must NOT contain any new line characters.

Parameters

$sentence : string: is a string without newline to be segmented

Return values

array<string|int, mixed> —

of segmented words

segmentText()

Segments text into terms separated by space


    public
                    segmentText(string $text[, string $normalize = false ]) : string

Parameters

$text : string: to be segmented
$normalize : string = false: return the normalized form 乾隆->干隆

Return values

string —

segmented terms with space

train()

Generate a term dictionary file for later segmentation


    public
                    train(mixed $text_files[, string $format = "default" ]) : mixed

Parameters

$text_files : mixed: is a string name or an array of files that to be trained; words in the files need to be segmented by space
$format : string = "default": currently only support default and CTB

Return values

mixed —

StochasticTermSegmenter in package Application

Tags

Table of Contents

Constants

MAX_TERM_LENGTH

Properties

$dictionary

$dictionary_path

$lang

$named_entity_tagger

$non_char_preg

$tokenizer

$unknown_term_score

$cache

$cache_pct

Methods

__construct()

Parameters

Return values

add()

Parameters

Return values

getScore()

Parameters

Return values

isException()

Parameters

Return values

isPunctuation()

Parameters

Return values

notCurrentLang()

Parameters

Return values

segmentFiles()

Parameters

Return values

segmentSentence()

Parameters

Return values

segmentText()

Parameters

Return values

train()

Parameters

Return values

StochasticTermSegmenter
in package

Application