Yioop_V9.5_Source_Code_Documentation

PartOfSpeechContextTagger extends ContextTagger
in package

Machine learning based Part of Speech tagger.

A PartOfSpeechContextTagger can be used to train a tagger for a language according to some dataset. Once training is complete it can be used to predict the tags for terms in a string or array of terms.

Tags
author

Xianghong Sun (Principal), Chris Pollett (mainly simplifications, and documentation)

Table of Contents

MIN_ENTROPY_CHANGE  = 1.0E-6
Minimum entropy needs to go down between epochs or we stop training
$bias  : array<string|int, mixed>
The bias vector for features we are training
$lang  : string
Locale tag of language this recognizer is for
$max_w  : float
Maximum allowed value for a weight component
$min_w  : float
Minimum allowed value for a weight component
$tag_feature  : array<string|int, mixed>
The weights for features involving the prior two tags to the current word whose tag we are trying to determine Determined during training
$tag_set  : array<string|int, mixed>
Array of strings for each possible tag for a term associated as [tag => tag index]
$tagger_file  : string
The name of the file where the tagging model should be stored and read from
$tagger_path  : string
Complete file system path to the file where the tagging model should be stored and read from
$tokenizer  : Tokenizer
Tokenizer for the language this tagger tags for
$word_feature  : array<string|int, mixed>
2D weights for features involving the prior two words to the current word and the next two words after the current word For a given word position, one has vector, that gives te value for each term in the complete training term set, unknown term set, and rule based tag term set, what its weight is Determined during training
__construct()  : mixed
Constructor for the part of speech tagger.
getB()  : float
Get the bias value for a tag
getIndex()  : mixed
Given a sentence (array $terms), find the key for the term at position $index
getKey()  : mixed
Maps a term to a corresponding key if the term matches some simple pattern such as being a number
getT()  : float
Get the tag feature value for tag
getW()  : float
Get the weight value for term at position for tag
loadWeights()  : mixed
Load the trained data from disk
packB()  : string
Pack the bias vector represented as an array into a string
packT()  : string
Pack the tag_feature represented as an array into a string
packW()  : string
Pack the weights matrix to a string for a particular part of speech key
predict()  : array<string|int, mixed>
Predicts the part of speech tag for each term in a sentence
processTexts()  : array<string|int, mixed>
Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]
saveWeights()  : mixed
Save the trained weight to disk
setB()  : mixed
Set the bias value for tag
tag()  : string
Tags a sequence of strings according to this tagger's predict method returning the tagged result as a string.
train()  : mixed
Uses text files containing sentences to create a matrix so that from a two term before a term, two term after a term context and a term, the odds of each of its possible parts of speech can be calculated.
unpackB()  : array<string|int, mixed>
Unpack the bias represented as a string into an array
unpackT()  : array<string|int, mixed>
Unpack the tag_feature represented as a string into an array
unpackW()  : array<string|int, mixed>
Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.

Constants

MIN_ENTROPY_CHANGE

Minimum entropy needs to go down between epochs or we stop training

public mixed MIN_ENTROPY_CHANGE = 1.0E-6

Properties

$bias

The bias vector for features we are training

public array<string|int, mixed> $bias

Determined during training

$lang

Locale tag of language this recognizer is for

public string $lang

$max_w

Maximum allowed value for a weight component

public float $max_w

$min_w

Minimum allowed value for a weight component

public float $min_w

$tag_feature

The weights for features involving the prior two tags to the current word whose tag we are trying to determine Determined during training

public array<string|int, mixed> $tag_feature

$tag_set

Array of strings for each possible tag for a term associated as [tag => tag index]

public array<string|int, mixed> $tag_set

$tagger_file

The name of the file where the tagging model should be stored and read from

public string $tagger_file = "tagger.txt.gz"

$tagger_path

Complete file system path to the file where the tagging model should be stored and read from

public string $tagger_path = ""

$tokenizer

Tokenizer for the language this tagger tags for

public Tokenizer $tokenizer

$word_feature

2D weights for features involving the prior two words to the current word and the next two words after the current word For a given word position, one has vector, that gives te value for each term in the complete training term set, unknown term set, and rule based tag term set, what its weight is Determined during training

public array<string|int, mixed> $word_feature

Methods

__construct()

Constructor for the part of speech tagger.

public __construct(string $lang) : mixed

Sets the language this tagger tags for and sets up the path for where it should be stored

Parameters
$lang : string

locale tag of the language this tagger tags is for

Return values
mixed

getB()

Get the bias value for a tag

public getB(int $tag_index) : float
Parameters
$tag_index : int

the index of tag's value within the bias string

Return values
float

bias value for tag

getIndex()

Given a sentence (array $terms), find the key for the term at position $index

public getIndex(int $index, array<string|int, mixed> $terms) : mixed
Parameters
$index : int

position of term to get key for

$terms : array<string|int, mixed>

an array of terms typically from and in the order of a sentence

Return values
mixed

key position in word_feature weights and bias arrays could be either an int, or the term itself, or the simple rule based part of speec it belongs to

getKey()

Maps a term to a corresponding key if the term matches some simple pattern such as being a number

public getKey(string $term) : mixed
Parameters
$term : string

is the term to be checked

Return values
mixed

either the int key for those matrices of just the term itself if the tokenizer does not ave the method getPosKey for the current language

getT()

Get the tag feature value for tag

public getT(int $key, int $tag_index) : float
Parameters
$key : int

in tag_feature set corresponding to a part of speech

$tag_index : int

the index of tag's value within the tag feature string

Return values
float

tag feature value for tag

getW()

Get the weight value for term at position for tag

public getW(string $term, int $position, int $tag_index) : float
Parameters
$term : string

to get weight of

$position : int

of term within the current 5-gram

$tag_index : int

index of the particular tag we are trying to see the term's weight for

Return values
float

loadWeights()

Load the trained data from disk

public loadWeights([bool $for_training = false ]) : mixed
Parameters
$for_training : bool = false

whether we are continuing to train (true) or whether we are using the loaded data for prediction

Return values
mixed

packB()

Pack the bias vector represented as an array into a string

public packB() : string
Return values
string

the bias vector packed as a string

packT()

Pack the tag_feature represented as an array into a string

public packT(int $key) : string
Parameters
$key : int

in tag_feature set corresponding to a part of speech

Return values
string

packed tag_feature vector

packW()

Pack the weights matrix to a string for a particular part of speech key

public packW(int $key) : string
Parameters
$key : int

index corresponding to a part of speech according to $this->tag_set

Return values
string

the packed weights matrix

predict()

Predicts the part of speech tag for each term in a sentence

public predict(mixed $sentence) : array<string|int, mixed>
Parameters
$sentence : mixed

is an array of segmented words/terms or a string with words/terms separated by space

Return values
array<string|int, mixed>

of tags for these terms

processTexts()

Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]

public static processTexts(mixed $text_files[, string $term_tag_separator = "_" ][, function $term_callback = null ][, function $tag_callback = null ][, bool $tag_on_array_chars = false ]) : array<string|int, mixed>
Parameters
$text_files : mixed

can be a file or an array of file names

$term_tag_separator : string = "_"

separator used to separate term and tag for terms in input sentence

$term_callback : function = null

callback function applied to a term before adding term to sentence term array

$tag_callback : function = null

callback function applied to a part of speech tag before adding tag to sentence tag array

$tag_on_array_chars : bool = false

for some kinds of text processing it better to assume the tags are applied to each char within a term rather than at the term level. For example, we might want to use char within a term for name entity tagging. THis flag if true says to do this; otherwise don't

Return values
array<string|int, mixed>

of separated sentences, each sentence having the format of [[terms...], [tags...]] Currently, the training data needs to fit Chinese Treebank format: term followed by a underscore and followed by the tag e.g. "新_VA 的_DEC 南斯拉夫_NR 会国_NN" To adapt to other language, some modifications are needed

saveWeights()

Save the trained weight to disk

public saveWeights() : mixed
Return values
mixed

setB()

Set the bias value for tag

public setB(int $tag_index, float $value) : mixed
Parameters
$tag_index : int

the index of tag's value within the bias string

$value : float

bias value to associate to tag

Return values
mixed

tag()

Tags a sequence of strings according to this tagger's predict method returning the tagged result as a string.

public tag(string $text[, string $tag_separator = "_" ]) : string

This function is mainly used to facilitate unit testing of taggers.

Parameters
$text : string

to be tagged

$tag_separator : string = "_"

terms in the output string will be the terms from the input texts followed by $tag_separator followed by their tag. So if $tag_separator == "_", then a term 中国 in the input texts might be 中国_NR in the output string

Return values
string

single string where terms in the input texts have been tagged. For example output might look like: 中国_NR 人民_NN 将_AD 满怀信心_VV 地_DEV 开创_VV 新_VA 的_DEC 业绩_NN 。_PU

train()

Uses text files containing sentences to create a matrix so that from a two term before a term, two term after a term context and a term, the odds of each of its possible parts of speech can be calculated.

public train(mixed $text_files[, string $term_tag_separator = "-" ][, float $learning_rate = 0.1 ][, mixed $num_epochs = 1200 ][, function $term_callback = null ][, function $tag_callback = null ][, bool $resume = false ]) : mixed
Parameters
$text_files : mixed

with training data. These can be a file or an array of file names. For now these files are assumed to be in Chinese Treebank format.

$term_tag_separator : string = "-"

separator used to separate term and tag for terms in input sentence

$learning_rate : float = 0.1

learning rate when cycling over data trying to minimize the cross-entropy loss in the prediction of the tag of the middle term.

$num_epochs : mixed = 1200
$term_callback : function = null

callback function applied to a term before adding term to sentence term array as part of processing and training with a sentence.

$tag_callback : function = null

callback function applied to a part of speech tag before adding tag to sentence tag array as part of processing and training with a sentence.

$resume : bool = false

if true, read the weight file and continue training if false, start from beginning

Return values
mixed

unpackB()

Unpack the bias represented as a string into an array

public unpackB() : array<string|int, mixed>
Return values
array<string|int, mixed>

the bias vector unpacked from a string

unpackT()

Unpack the tag_feature represented as a string into an array

public unpackT(int $key) : array<string|int, mixed>
Parameters
$key : int

in tag_feature set corresponding to a part of speech

Return values
array<string|int, mixed>

unpacked tag_feature vector

unpackW()

Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.

public unpackW(int $key) : array<string|int, mixed>

An (i, j) entry roughly gives the probability of the j term in location i having the part of speech given by $key

Parameters
$key : int

in word_feature set corresponding to a part of speech

Return values
array<string|int, mixed>

of weights corresponding to that key


        

Search results