ContextTagger
in package
Abstract, base context tagger class.
A context tagger is used to apply a sequence of labels to a sequence terms or characters of text based on a surrounding context. Context Taggers typically make use of n-gram context of a term such as the n/2 - terms before and after the term and maybe the earlier tags from a same phrase or sentence to make prediction
Tags
Table of Contents
- $bias : array<string|int, mixed>
- The bias vector for features we are training
- $lang : string
- Locale tag of language this recognizer is for
- $max_w : float
- Maximum allowed value for a weight component
- $min_w : float
- Minimum allowed value for a weight component
- $tag_feature : array<string|int, mixed>
- The weights for features involving the prior two tags to the current word whose tag we are trying to determine Determined during training
- $tag_set : array<string|int, mixed>
- Array of strings for each possible tag for a term associated as [tag => tag index]
- $tagger_file : string
- The name of the file where the tagging model should be stored and read from
- $tagger_path : string
- Complete file system path to the file where the tagging model should be stored and read from
- $tokenizer : Tokenizer
- Tokenizer for the language this tagger tags for
- $word_feature : array<string|int, mixed>
- 2D weights for features involving the prior two words to the current word and the next two words after the current word For a given word position, one has vector, that gives te value for each term in the complete training term set, unknown term set, and rule based tag term set, what its weight is Determined during training
- __construct() : mixed
- Constructor for the ContextTagger.
- getB() : float
- Get the bias value for a tag
- getIndex() : mixed
- Given a sentence (array $terms), find the key for the term at position $index
- getKey() : mixed
- Maps a term to a corresponding key if the term matches some simple pattern such as being a number
- getT() : float
- Get the tag feature value for tag
- getW() : float
- Get the weight value for term at position for tag
- loadWeights() : mixed
- Load the trained data from disk
- packB() : string
- Pack the bias vector represented as an array into a string
- packT() : string
- Pack the tag_feature represented as an array into a string
- packW() : string
- Pack the weights matrix to a string for a particular part of speech key
- predict() : array<string|int, mixed>
- Predicts a tagging for all elements of $sentence
- processTexts() : array<string|int, mixed>
- Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]
- saveWeights() : mixed
- Save the trained weight to disk
- setB() : mixed
- Set the bias value for tag
- tag() : string
- Tags a sequence of strings according to this tagger's predict method returning the tagged result as a string.
- train() : mixed
- Uses text files to train a tagger for terms or chars in a document
- unpackB() : array<string|int, mixed>
- Unpack the bias represented as a string into an array
- unpackT() : array<string|int, mixed>
- Unpack the tag_feature represented as a string into an array
- unpackW() : array<string|int, mixed>
- Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.
Properties
$bias
The bias vector for features we are training
public
array<string|int, mixed>
$bias
Determined during training
$lang
Locale tag of language this recognizer is for
public
string
$lang
$max_w
Maximum allowed value for a weight component
public
float
$max_w
$min_w
Minimum allowed value for a weight component
public
float
$min_w
$tag_feature
The weights for features involving the prior two tags to the current word whose tag we are trying to determine Determined during training
public
array<string|int, mixed>
$tag_feature
$tag_set
Array of strings for each possible tag for a term associated as [tag => tag index]
public
array<string|int, mixed>
$tag_set
$tagger_file
The name of the file where the tagging model should be stored and read from
public
string
$tagger_file
= "tagger.txt.gz"
$tagger_path
Complete file system path to the file where the tagging model should be stored and read from
public
string
$tagger_path
= ""
$tokenizer
Tokenizer for the language this tagger tags for
public
Tokenizer
$tokenizer
$word_feature
2D weights for features involving the prior two words to the current word and the next two words after the current word For a given word position, one has vector, that gives te value for each term in the complete training term set, unknown term set, and rule based tag term set, what its weight is Determined during training
public
array<string|int, mixed>
$word_feature
Methods
__construct()
Constructor for the ContextTagger.
public
__construct(string $lang) : mixed
Sets the language this tagger tags for and sets up the path for where it should be stored
Parameters
- $lang : string
-
locale tag of the language this tagger tags is for
Return values
mixed —getB()
Get the bias value for a tag
public
getB(int $tag_index) : float
Parameters
- $tag_index : int
-
the index of tag's value within the bias string
Return values
float —bias value for tag
getIndex()
Given a sentence (array $terms), find the key for the term at position $index
public
getIndex(int $index, array<string|int, mixed> $terms) : mixed
Parameters
- $index : int
-
position of term to get key for
- $terms : array<string|int, mixed>
-
an array of terms typically from and in the order of a sentence
Return values
mixed —key position in word_feature weights and bias arrays could be either an int, or the term itself, or the simple rule based part of speec it belongs to
getKey()
Maps a term to a corresponding key if the term matches some simple pattern such as being a number
public
getKey(string $term) : mixed
Parameters
- $term : string
-
is the term to be checked
Return values
mixed —either the int key for those matrices of just the term itself if the tokenizer does not ave the method getPosKey for the current language
getT()
Get the tag feature value for tag
public
getT(int $key, int $tag_index) : float
Parameters
- $key : int
-
in tag_feature set corresponding to a part of speech
- $tag_index : int
-
the index of tag's value within the tag feature string
Return values
float —tag feature value for tag
getW()
Get the weight value for term at position for tag
public
getW(string $term, int $position, int $tag_index) : float
Parameters
- $term : string
-
to get weight of
- $position : int
-
of term within the current 5-gram
- $tag_index : int
-
index of the particular tag we are trying to see the term's weight for
Return values
float —loadWeights()
Load the trained data from disk
public
loadWeights([bool $for_training = false ]) : mixed
Parameters
- $for_training : bool = false
-
whether we are continuing to train (true) or whether we are using the loaded data for prediction
Return values
mixed —packB()
Pack the bias vector represented as an array into a string
public
packB() : string
Return values
string —the bias vector packed as a string
packT()
Pack the tag_feature represented as an array into a string
public
packT(int $key) : string
Parameters
- $key : int
-
in tag_feature set corresponding to a part of speech
Return values
string —packed tag_feature vector
packW()
Pack the weights matrix to a string for a particular part of speech key
public
packW(int $key) : string
Parameters
- $key : int
-
index corresponding to a part of speech according to $this->tag_set
Return values
string —the packed weights matrix
predict()
Predicts a tagging for all elements of $sentence
public
abstract predict(mixed $sentence) : array<string|int, mixed>
Parameters
- $sentence : mixed
-
is an array of segmented terms/chars or a string that will be split on white space
Return values
array<string|int, mixed> —predicted tags. The ith entry in the returned results is the tag of ith element of $sentence
processTexts()
Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]
public
static processTexts(mixed $text_files[, string $term_tag_separator = "_" ][, function $term_callback = null ][, function $tag_callback = null ][, bool $tag_on_array_chars = false ]) : array<string|int, mixed>
Parameters
- $text_files : mixed
-
can be a file or an array of file names
- $term_tag_separator : string = "_"
-
separator used to separate term and tag for terms in input sentence
- $term_callback : function = null
-
callback function applied to a term before adding term to sentence term array
- $tag_callback : function = null
-
callback function applied to a part of speech tag before adding tag to sentence tag array
- $tag_on_array_chars : bool = false
-
for some kinds of text processing it better to assume the tags are applied to each char within a term rather than at the term level. For example, we might want to use char within a term for name entity tagging. THis flag if true says to do this; otherwise don't
Return values
array<string|int, mixed> —of separated sentences, each sentence having the format of [[terms...], [tags...]] Currently, the training data needs to fit Chinese Treebank format: term followed by a underscore and followed by the tag e.g. "新_VA 的_DEC 南斯拉夫_NR 会国_NN" To adapt to other language, some modifications are needed
saveWeights()
Save the trained weight to disk
public
saveWeights() : mixed
Return values
mixed —setB()
Set the bias value for tag
public
setB(int $tag_index, float $value) : mixed
Parameters
- $tag_index : int
-
the index of tag's value within the bias string
- $value : float
-
bias value to associate to tag
Return values
mixed —tag()
Tags a sequence of strings according to this tagger's predict method returning the tagged result as a string.
public
tag(string $text[, string $tag_separator = "_" ]) : string
This function is mainly used to facilitate unit testing of taggers.
Parameters
- $text : string
-
to be tagged
- $tag_separator : string = "_"
-
terms in the output string will be the terms from the input texts followed by $tag_separator followed by their tag. So if $tag_separator == "_", then a term 中国 in the input texts might be 中国_NR in the output string
Return values
string —single string where terms in the input texts have been tagged. For example output might look like: 中国_NR 人民_NN 将_AD 满怀信心_VV 地_DEV 开创_VV 新_VA 的_DEC 业绩_NN 。_PU
train()
Uses text files to train a tagger for terms or chars in a document
public
abstract train(mixed $text_files[, string $term_tag_separator = "-" ][, float $learning_rate = 0.1 ][, int $num_epoch = 1200 ][, function $term_callback = null ][, function $tag_callback = null ][, mixed $resume = false ]) : mixed
Parameters
- $text_files : mixed
-
with training data. These can be a file or an array of file names.
- $term_tag_separator : string = "-"
-
separator used to separate term and tag for terms in input sentence
- $learning_rate : float = 0.1
-
learning rate when cycling over data trying to minimize the cross-entropy loss in the prediction of the tag of the middle term.
- $num_epoch : int = 1200
-
number of times to cycle through the complete data set. Default value of 1200 seems to avoid overfitting
- $term_callback : function = null
-
callback function applied to a term before adding term to sentence term array as part of processing and training with a sentence.
- $tag_callback : function = null
-
callback function applied to a part of speech tag before adding tag to sentence tag array as part of processing and training with a sentence.
- $resume : mixed = false
Return values
mixed —unpackB()
Unpack the bias represented as a string into an array
public
unpackB() : array<string|int, mixed>
Return values
array<string|int, mixed> —the bias vector unpacked from a string
unpackT()
Unpack the tag_feature represented as a string into an array
public
unpackT(int $key) : array<string|int, mixed>
Parameters
- $key : int
-
in tag_feature set corresponding to a part of speech
Return values
array<string|int, mixed> —unpacked tag_feature vector
unpackW()
Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.
public
unpackW(int $key) : array<string|int, mixed>
An (i, j) entry roughly gives the probability of the j term in location i having the part of speech given by $key
Parameters
- $key : int
-
in word_feature set corresponding to a part of speech
Return values
array<string|int, mixed> —of weights corresponding to that key