NamedEntityContextTagger
extends ContextTagger
in package
Machine learning based named entity recognizer.
NamedEntityContextTagger is used by @see StochasticTermSegmenter to help in segmenting sentences in which no term separators such as spaces are used.
Tags
Table of Contents
- MAX_ENTITY_LENGTH = 10
- Maximum character length of a named entity
- MIN_ENTROPY_CHANGE = 1.0E-6
- Minimum entropy needs to go down between epochs or we stop training
- $bias : array<string|int, mixed>
- The bias vector for features we are training
- $lang : string
- Locale tag of language this recognizer is for
- $max_w : float
- Maximum allowed value for a weight component
- $min_w : float
- Minimum allowed value for a weight component
- $tag_feature : array<string|int, mixed>
- The weights for features involving the prior two tags to the current word whose tag we are trying to determine Determined during training
- $tag_set : array<string|int, mixed>
- Array of strings for each possible tag for a term associated as [tag => tag index]
- $tagger_file : string
- The name of the file where the tagging model should be stored and read from
- $tagger_path : string
- Complete file system path to the file where the tagging model should be stored and read from
- $tokenizer : Tokenizer
- Tokenizer for the language this tagger tags for
- $word_feature : array<string|int, mixed>
- 2D weights for features involving the prior two words to the current word and the next two words after the current word For a given word position, one has vector, that gives te value for each term in the complete training term set, unknown term set, and rule based tag term set, what its weight is Determined during training
- __construct() : mixed
- Constructor for the NamedEntityContextTagger.
- getB() : float
- Get the bias value for a tag
- getIndex() : mixed
- Given a sentence (array $terms), find the key for the term at position $index
- getKey() : mixed
- Maps a term to a corresponding key if the term matches some simple pattern such as being a number
- getT() : float
- Get the tag feature value for tag
- getW() : float
- Get the weight value for term at position for tag
- loadWeights() : mixed
- Load the trained data from disk
- packB() : string
- Pack the bias vector represented as an array into a string
- packT() : string
- Pack the tag_feature represented as an array into a string
- packW() : string
- Pack the weights matrix to a string for a particular part of speech key
- predict() : array<string|int, mixed>
- Predicts named entities that exists in a sentence.
- processTexts() : array<string|int, mixed>
- Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]
- saveWeights() : mixed
- Save the trained weight to disk
- setB() : mixed
- Set the bias value for tag
- tag() : string
- Tags a sequence of strings according to this tagger's predict method returning the tagged result as a string.
- train() : mixed
- Uses text files containing sentences to create a matrix so that from a two chars before a term, two chars after a char context, together with a two tags before a term context and a term, the odds that a named entity as been found can be calculated Format of training file should be a tagged white space separated terms If the separator was '-', then non-named entity examples should look like term-o, and named entity example might look like term-nr or term-nt where nr = proper noun, ns = place name, nt = temporal noun. The use of a $tag_callback might help in mapping more general datasets into this format
- unpackB() : array<string|int, mixed>
- Unpack the bias represented as a string into an array
- unpackT() : array<string|int, mixed>
- Unpack the tag_feature represented as a string into an array
- unpackW() : array<string|int, mixed>
- Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.
Constants
MAX_ENTITY_LENGTH
Maximum character length of a named entity
public
mixed
MAX_ENTITY_LENGTH
= 10
MIN_ENTROPY_CHANGE
Minimum entropy needs to go down between epochs or we stop training
public
mixed
MIN_ENTROPY_CHANGE
= 1.0E-6
Properties
$bias
The bias vector for features we are training
public
array<string|int, mixed>
$bias
Determined during training
$lang
Locale tag of language this recognizer is for
public
string
$lang
$max_w
Maximum allowed value for a weight component
public
float
$max_w
$min_w
Minimum allowed value for a weight component
public
float
$min_w
$tag_feature
The weights for features involving the prior two tags to the current word whose tag we are trying to determine Determined during training
public
array<string|int, mixed>
$tag_feature
$tag_set
Array of strings for each possible tag for a term associated as [tag => tag index]
public
array<string|int, mixed>
$tag_set
$tagger_file
The name of the file where the tagging model should be stored and read from
public
string
$tagger_file
= "tagger.txt.gz"
$tagger_path
Complete file system path to the file where the tagging model should be stored and read from
public
string
$tagger_path
= ""
$tokenizer
Tokenizer for the language this tagger tags for
public
Tokenizer
$tokenizer
$word_feature
2D weights for features involving the prior two words to the current word and the next two words after the current word For a given word position, one has vector, that gives te value for each term in the complete training term set, unknown term set, and rule based tag term set, what its weight is Determined during training
public
array<string|int, mixed>
$word_feature
Methods
__construct()
Constructor for the NamedEntityContextTagger.
public
__construct(string $lang) : mixed
Sets the language this tagger tags for and sets up the path for where it should be stored
Parameters
- $lang : string
-
locale tag of the language this tagger tags is for
Return values
mixed —getB()
Get the bias value for a tag
public
getB(int $tag_index) : float
Parameters
- $tag_index : int
-
the index of tag's value within the bias string
Return values
float —bias value for tag
getIndex()
Given a sentence (array $terms), find the key for the term at position $index
public
getIndex(int $index, array<string|int, mixed> $terms) : mixed
Parameters
- $index : int
-
position of term to get key for
- $terms : array<string|int, mixed>
-
an array of terms typically from and in the order of a sentence
Return values
mixed —key position in word_feature weights and bias arrays could be either an int, or the term itself, or the simple rule based part of speec it belongs to
getKey()
Maps a term to a corresponding key if the term matches some simple pattern such as being a number
public
getKey(string $term) : mixed
Parameters
- $term : string
-
is the term to be checked
Return values
mixed —either the int key for those matrices of just the term itself if the tokenizer does not ave the method getPosKey for the current language
getT()
Get the tag feature value for tag
public
getT(int $key, int $tag_index) : float
Parameters
- $key : int
-
in tag_feature set corresponding to a part of speech
- $tag_index : int
-
the index of tag's value within the tag feature string
Return values
float —tag feature value for tag
getW()
Get the weight value for term at position for tag
public
getW(string $term, int $position, int $tag_index) : float
Parameters
- $term : string
-
to get weight of
- $position : int
-
of term within the current 5-gram
- $tag_index : int
-
index of the particular tag we are trying to see the term's weight for
Return values
float —loadWeights()
Load the trained data from disk
public
loadWeights([bool $for_training = false ]) : mixed
Parameters
- $for_training : bool = false
-
whether we are continuing to train (true) or whether we are using the loaded data for prediction
Return values
mixed —packB()
Pack the bias vector represented as an array into a string
public
packB() : string
Return values
string —the bias vector packed as a string
packT()
Pack the tag_feature represented as an array into a string
public
packT(int $key) : string
Parameters
- $key : int
-
in tag_feature set corresponding to a part of speech
Return values
string —packed tag_feature vector
packW()
Pack the weights matrix to a string for a particular part of speech key
public
packW(int $key) : string
Parameters
- $key : int
-
index corresponding to a part of speech according to $this->tag_set
Return values
string —the packed weights matrix
predict()
Predicts named entities that exists in a sentence.
public
predict(mixed $sentence) : array<string|int, mixed>
Parameters
- $sentence : mixed
-
is an array of segmented words/terms or a string that will be split on white space
Return values
array<string|int, mixed> —all predicted named entities together with a tag indicating kind of named entity ex. [["郑振铎","nr"],["国民党","nt"]]
processTexts()
Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]
public
static processTexts(mixed $text_files[, string $term_tag_separator = "_" ][, function $term_callback = null ][, function $tag_callback = null ][, bool $tag_on_array_chars = false ]) : array<string|int, mixed>
Parameters
- $text_files : mixed
-
can be a file or an array of file names
- $term_tag_separator : string = "_"
-
separator used to separate term and tag for terms in input sentence
- $term_callback : function = null
-
callback function applied to a term before adding term to sentence term array
- $tag_callback : function = null
-
callback function applied to a part of speech tag before adding tag to sentence tag array
- $tag_on_array_chars : bool = false
-
for some kinds of text processing it better to assume the tags are applied to each char within a term rather than at the term level. For example, we might want to use char within a term for name entity tagging. THis flag if true says to do this; otherwise don't
Return values
array<string|int, mixed> —of separated sentences, each sentence having the format of [[terms...], [tags...]] Currently, the training data needs to fit Chinese Treebank format: term followed by a underscore and followed by the tag e.g. "新_VA 的_DEC 南斯拉夫_NR 会国_NN" To adapt to other language, some modifications are needed
saveWeights()
Save the trained weight to disk
public
saveWeights() : mixed
Return values
mixed —setB()
Set the bias value for tag
public
setB(int $tag_index, float $value) : mixed
Parameters
- $tag_index : int
-
the index of tag's value within the bias string
- $value : float
-
bias value to associate to tag
Return values
mixed —tag()
Tags a sequence of strings according to this tagger's predict method returning the tagged result as a string.
public
tag(string $text[, string $tag_separator = "_" ]) : string
This function is mainly used to facilitate unit testing of taggers.
Parameters
- $text : string
-
to be tagged
- $tag_separator : string = "_"
-
terms in the output string will be the terms from the input texts followed by $tag_separator followed by their tag. So if $tag_separator == "_", then a term 中国 in the input texts might be 中国_NR in the output string
Return values
string —single string where terms in the input texts have been tagged. For example output might look like: 中国_NR 人民_NN 将_AD 满怀信心_VV 地_DEV 开创_VV 新_VA 的_DEC 业绩_NN 。_PU
train()
Uses text files containing sentences to create a matrix so that from a two chars before a term, two chars after a char context, together with a two tags before a term context and a term, the odds that a named entity as been found can be calculated Format of training file should be a tagged white space separated terms If the separator was '-', then non-named entity examples should look like term-o, and named entity example might look like term-nr or term-nt where nr = proper noun, ns = place name, nt = temporal noun. The use of a $tag_callback might help in mapping more general datasets into this format
public
train(mixed $text_files[, string $term_tag_separator = "-" ][, float $learning_rate = 0.1 ][, int $num_epochs = 1200 ][, function $term_callback = null ][, function $tag_callback = null ][, mixed $resume = false ]) : mixed
Parameters
- $text_files : mixed
-
with training data. These can be a file or an array of file names.
- $term_tag_separator : string = "-"
-
separator used to separate term and tag for terms in input sentence
- $learning_rate : float = 0.1
-
learning rate when cycling over data trying to minimize the cross-entropy loss in the prediction of the tag of the middle term.
- $num_epochs : int = 1200
-
number of times to cycle through the complete data set. Default value of 1200 seems to avoid overfitting
- $term_callback : function = null
-
callback function applied to a term before adding term to sentence term array as part of processing and training with a sentence.
- $tag_callback : function = null
-
callback function applied to a part of speech tag before adding tag to sentence tag array as part of processing and training with a sentence.
- $resume : mixed = false
Return values
mixed —unpackB()
Unpack the bias represented as a string into an array
public
unpackB() : array<string|int, mixed>
Return values
array<string|int, mixed> —the bias vector unpacked from a string
unpackT()
Unpack the tag_feature represented as a string into an array
public
unpackT(int $key) : array<string|int, mixed>
Parameters
- $key : int
-
in tag_feature set corresponding to a part of speech
Return values
array<string|int, mixed> —unpacked tag_feature vector
unpackW()
Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.
public
unpackW(int $key) : array<string|int, mixed>
An (i, j) entry roughly gives the probability of the j term in location i having the part of speech given by $key
Parameters
- $key : int
-
in word_feature set corresponding to a part of speech
Return values
array<string|int, mixed> —of weights corresponding to that key