Yioop_V9.5_Source_Code

PartOfSpeechContextTagger extends ContextTagger
in package

Application

Machine learning based Part of Speech tagger.

A PartOfSpeechContextTagger can be used to train a tagger for a language according to some dataset. Once training is complete it can be used to predict the tags for terms in a string or array of terms.

MIN_ENTROPY_CHANGE

Minimum entropy needs to go down between epochs or we stop training


    public
        mixed
    MIN_ENTROPY_CHANGE
    = 1.0E-6

$bias

The bias vector for features we are training


    public
        array<string|int, mixed>
    $bias

Determined during training

$lang

Locale tag of language this recognizer is for


    public
        string
    $lang

$max_w

Maximum allowed value for a weight component


    public
        float
    $max_w

$min_w

Minimum allowed value for a weight component


    public
        float
    $min_w

$tag_feature

The weights for features involving the prior two tags to the current word whose tag we are trying to determine Determined during training


    public
        array<string|int, mixed>
    $tag_feature

$tag_set

Array of strings for each possible tag for a term associated as [tag => tag index]


    public
        array<string|int, mixed>
    $tag_set

$tagger_file

The name of the file where the tagging model should be stored and read from


    public
        string
    $tagger_file
     = "tagger.txt.gz"

$tagger_path

Complete file system path to the file where the tagging model should be stored and read from


    public
        string
    $tagger_path
     = ""

$tokenizer

Tokenizer for the language this tagger tags for


    public
        Tokenizer
    $tokenizer

$word_feature

2D weights for features involving the prior two words to the current word and the next two words after the current word For a given word position, one has vector, that gives te value for each term in the complete training term set, unknown term set, and rule based tag term set, what its weight is Determined during training


    public
        array<string|int, mixed>
    $word_feature

__construct()

Constructor for the part of speech tagger.


    public
                    __construct(string $lang) : mixed

Sets the language this tagger tags for and sets up the path for where it should be stored

Parameters

$lang : string: locale tag of the language this tagger tags is for

Return values

mixed —

getB()

Get the bias value for a tag


    public
                    getB(int $tag_index) : float

Parameters

$tag_index : int: the index of tag's value within the bias string

Return values

float —

bias value for tag

getIndex()

Given a sentence (array $terms), find the key for the term at position $index


    public
                    getIndex(int $index, array<string|int, mixed> $terms) : mixed

Parameters

$index : int: position of term to get key for
$terms : array<string|int, mixed>: an array of terms typically from and in the order of a sentence

Return values

mixed —

key position in word_feature weights and bias arrays could be either an int, or the term itself, or the simple rule based part of speec it belongs to

getKey()

Maps a term to a corresponding key if the term matches some simple pattern such as being a number


    public
                    getKey(string $term) : mixed

Parameters

$term : string: is the term to be checked

Return values

mixed —

either the int key for those matrices of just the term itself if the tokenizer does not ave the method getPosKey for the current language

getT()

Get the tag feature value for tag


    public
                    getT(int $key, int $tag_index) : float

Parameters

$key : int: in tag_feature set corresponding to a part of speech
$tag_index : int: the index of tag's value within the tag feature string

Return values

float —

tag feature value for tag

getW()

Get the weight value for term at position for tag


    public
                    getW(string $term, int $position, int $tag_index) : float

Parameters

$term : string: to get weight of
$position : int: of term within the current 5-gram
$tag_index : int: index of the particular tag we are trying to see the term's weight for

Return values

float —

loadWeights()

Load the trained data from disk


    public
                    loadWeights([bool $for_training = false ]) : mixed

Parameters

$for_training : bool = false: whether we are continuing to train (true) or whether we are using the loaded data for prediction

Return values

mixed —

packB()

Pack the bias vector represented as an array into a string


    public
                    packB() : string

Return values

string —

the bias vector packed as a string

packT()

Pack the tag_feature represented as an array into a string


    public
                    packT(int $key) : string

Parameters

$key : int: in tag_feature set corresponding to a part of speech

Return values

string —

packed tag_feature vector

packW()

Pack the weights matrix to a string for a particular part of speech key


    public
                    packW(int $key) : string

Parameters

$key : int: index corresponding to a part of speech according to $this->tag_set

Return values

string —

the packed weights matrix

predict()

Predicts the part of speech tag for each term in a sentence


    public
                    predict(mixed $sentence) : array<string|int, mixed>

Parameters

$sentence : mixed: is an array of segmented words/terms or a string with words/terms separated by space

Return values

array<string|int, mixed> —

of tags for these terms

processTexts()

Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]


    public
            static        processTexts(mixed $text_files[, string $term_tag_separator = "_" ][, function $term_callback = null ][, function $tag_callback = null ][, bool $tag_on_array_chars = false ]) : array<string|int, mixed>

Parameters

$text_files : mixed: can be a file or an array of file names
$term_tag_separator : string = "_": separator used to separate term and tag for terms in input sentence
$term_callback : function = null: callback function applied to a term before adding term to sentence term array
$tag_callback : function = null: callback function applied to a part of speech tag before adding tag to sentence tag array
$tag_on_array_chars : bool = false: for some kinds of text processing it better to assume the tags are applied to each char within a term rather than at the term level. For example, we might want to use char within a term for name entity tagging. THis flag if true says to do this; otherwise don't

Return values

array<string|int, mixed> —

of separated sentences, each sentence having the format of [[terms...], [tags...]] Currently, the training data needs to fit Chinese Treebank format: term followed by a underscore and followed by the tag e.g. "新_VA 的_DEC 南斯拉夫_NR 会国_NN" To adapt to other language, some modifications are needed

saveWeights()

Save the trained weight to disk


    public
                    saveWeights() : mixed

Return values

mixed —

setB()

Set the bias value for tag


    public
                    setB(int $tag_index, float $value) : mixed

Parameters

$tag_index : int: the index of tag's value within the bias string
$value : float: bias value to associate to tag

Return values

mixed —

tag()

Tags a sequence of strings according to this tagger's predict method returning the tagged result as a string.


    public
                    tag(string $text[, string $tag_separator = "_" ]) : string

This function is mainly used to facilitate unit testing of taggers.

Parameters

$text : string: to be tagged
$tag_separator : string = "_": terms in the output string will be the terms from the input texts followed by $tag_separator followed by their tag. So if $tag_separator == "_", then a term 中国 in the input texts might be 中国_NR in the output string

Return values

string —

single string where terms in the input texts have been tagged. For example output might look like: 中国_NR 人民_NN 将_AD 满怀信心_VV 地_DEV 开创_VV 新_VA 的_DEC 业绩_NN 。_PU

train()

Uses text files containing sentences to create a matrix so that from a two term before a term, two term after a term context and a term, the odds of each of its possible parts of speech can be calculated.


    public
                    train(mixed $text_files[, string $term_tag_separator = "-" ][, float $learning_rate = 0.1 ][, mixed $num_epochs = 1200 ][, function $term_callback = null ][, function $tag_callback = null ][, bool $resume = false ]) : mixed

Parameters

$text_files : mixed: with training data. These can be a file or an array of file names. For now these files are assumed to be in Chinese Treebank format.
$term_tag_separator : string = "-": separator used to separate term and tag for terms in input sentence
$learning_rate : float = 0.1: learning rate when cycling over data trying to minimize the cross-entropy loss in the prediction of the tag of the middle term.
$num_epochs : mixed = 1200
$term_callback : function = null: callback function applied to a term before adding term to sentence term array as part of processing and training with a sentence.
$tag_callback : function = null: callback function applied to a part of speech tag before adding tag to sentence tag array as part of processing and training with a sentence.
$resume : bool = false: if true, read the weight file and continue training if false, start from beginning

Return values

mixed —

unpackB()

Unpack the bias represented as a string into an array


    public
                    unpackB() : array<string|int, mixed>

Return values

array<string|int, mixed> —

the bias vector unpacked from a string

unpackT()

Unpack the tag_feature represented as a string into an array


    public
                    unpackT(int $key) : array<string|int, mixed>

Parameters

$key : int: in tag_feature set corresponding to a part of speech

Return values

array<string|int, mixed> —

unpacked tag_feature vector

unpackW()

Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.


    public
                    unpackW(int $key) : array<string|int, mixed>

An (i, j) entry roughly gives the probability of the j term in location i having the part of speech given by $key

Parameters

$key : int: in word_feature set corresponding to a part of speech

Return values

array<string|int, mixed> —

of weights corresponding to that key

PartOfSpeechContextTagger extends ContextTagger in package Application

Tags

Table of Contents

Constants

MIN_ENTROPY_CHANGE

Properties

$bias

$lang

$max_w

$min_w

$tag_feature

$tag_set

$tagger_file

$tagger_path

$tokenizer

$word_feature

Methods

__construct()

Parameters

Return values

getB()

Parameters

Return values

getIndex()

Parameters

Return values

getKey()

Parameters

Return values

getT()

Parameters

Return values

getW()

Parameters

Return values

loadWeights()

Parameters

Return values

packB()

Return values

packT()

Parameters

Return values

packW()

Parameters

Return values

predict()

Parameters

Return values

processTexts()

Parameters

Return values

saveWeights()

Return values

setB()

Parameters

Return values

tag()

Parameters

Return values

train()

Parameters

Return values

unpackB()

Return values

unpackT()

Parameters

Return values

unpackW()

Parameters

Return values

PartOfSpeechContextTagger extends ContextTagger
in package

Application