Yioop_V9.5_Source_Code_Documentation

ContextTagger
in package

Abstract, base context tagger class.

A context tagger is used to apply a sequence of labels to a sequence terms or characters of text based on a surrounding context. Context Taggers typically make use of n-gram context of a term such as the n/2 - terms before and after the term and maybe the earlier tags from a same phrase or sentence to make prediction

Tags
author

Chris Pollett

Table of Contents

$bias  : array<string|int, mixed>
The bias vector for features we are training
$lang  : string
Locale tag of language this recognizer is for
$max_w  : float
Maximum allowed value for a weight component
$min_w  : float
Minimum allowed value for a weight component
$tag_feature  : array<string|int, mixed>
The weights for features involving the prior two tags to the current word whose tag we are trying to determine Determined during training
$tag_set  : array<string|int, mixed>
Array of strings for each possible tag for a term associated as [tag => tag index]
$tagger_file  : string
The name of the file where the tagging model should be stored and read from
$tagger_path  : string
Complete file system path to the file where the tagging model should be stored and read from
$tokenizer  : Tokenizer
Tokenizer for the language this tagger tags for
$word_feature  : array<string|int, mixed>
2D weights for features involving the prior two words to the current word and the next two words after the current word For a given word position, one has vector, that gives te value for each term in the complete training term set, unknown term set, and rule based tag term set, what its weight is Determined during training
__construct()  : mixed
Constructor for the ContextTagger.
getB()  : float
Get the bias value for a tag
getIndex()  : mixed
Given a sentence (array $terms), find the key for the term at position $index
getKey()  : mixed
Maps a term to a corresponding key if the term matches some simple pattern such as being a number
getT()  : float
Get the tag feature value for tag
getW()  : float
Get the weight value for term at position for tag
loadWeights()  : mixed
Load the trained data from disk
packB()  : string
Pack the bias vector represented as an array into a string
packT()  : string
Pack the tag_feature represented as an array into a string
packW()  : string
Pack the weights matrix to a string for a particular part of speech key
predict()  : array<string|int, mixed>
Predicts a tagging for all elements of $sentence
processTexts()  : array<string|int, mixed>
Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]
saveWeights()  : mixed
Save the trained weight to disk
setB()  : mixed
Set the bias value for tag
tag()  : string
Tags a sequence of strings according to this tagger's predict method returning the tagged result as a string.
train()  : mixed
Uses text files to train a tagger for terms or chars in a document
unpackB()  : array<string|int, mixed>
Unpack the bias represented as a string into an array
unpackT()  : array<string|int, mixed>
Unpack the tag_feature represented as a string into an array
unpackW()  : array<string|int, mixed>
Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.

Properties

$bias

The bias vector for features we are training

public array<string|int, mixed> $bias

Determined during training

$lang

Locale tag of language this recognizer is for

public string $lang

$max_w

Maximum allowed value for a weight component

public float $max_w

$min_w

Minimum allowed value for a weight component

public float $min_w

$tag_feature

The weights for features involving the prior two tags to the current word whose tag we are trying to determine Determined during training

public array<string|int, mixed> $tag_feature

$tag_set

Array of strings for each possible tag for a term associated as [tag => tag index]

public array<string|int, mixed> $tag_set

$tagger_file

The name of the file where the tagging model should be stored and read from

public string $tagger_file = "tagger.txt.gz"

$tagger_path

Complete file system path to the file where the tagging model should be stored and read from

public string $tagger_path = ""

$tokenizer

Tokenizer for the language this tagger tags for

public Tokenizer $tokenizer

$word_feature

2D weights for features involving the prior two words to the current word and the next two words after the current word For a given word position, one has vector, that gives te value for each term in the complete training term set, unknown term set, and rule based tag term set, what its weight is Determined during training

public array<string|int, mixed> $word_feature

Methods

__construct()

Constructor for the ContextTagger.

public __construct(string $lang) : mixed

Sets the language this tagger tags for and sets up the path for where it should be stored

Parameters
$lang : string

locale tag of the language this tagger tags is for

Return values
mixed

getB()

Get the bias value for a tag

public getB(int $tag_index) : float
Parameters
$tag_index : int

the index of tag's value within the bias string

Return values
float

bias value for tag

getIndex()

Given a sentence (array $terms), find the key for the term at position $index

public getIndex(int $index, array<string|int, mixed> $terms) : mixed
Parameters
$index : int

position of term to get key for

$terms : array<string|int, mixed>

an array of terms typically from and in the order of a sentence

Return values
mixed

key position in word_feature weights and bias arrays could be either an int, or the term itself, or the simple rule based part of speec it belongs to

getKey()

Maps a term to a corresponding key if the term matches some simple pattern such as being a number

public getKey(string $term) : mixed
Parameters
$term : string

is the term to be checked

Return values
mixed

either the int key for those matrices of just the term itself if the tokenizer does not ave the method getPosKey for the current language

getT()

Get the tag feature value for tag

public getT(int $key, int $tag_index) : float
Parameters
$key : int

in tag_feature set corresponding to a part of speech

$tag_index : int

the index of tag's value within the tag feature string

Return values
float

tag feature value for tag

getW()

Get the weight value for term at position for tag

public getW(string $term, int $position, int $tag_index) : float
Parameters
$term : string

to get weight of

$position : int

of term within the current 5-gram

$tag_index : int

index of the particular tag we are trying to see the term's weight for

Return values
float

loadWeights()

Load the trained data from disk

public loadWeights([bool $for_training = false ]) : mixed
Parameters
$for_training : bool = false

whether we are continuing to train (true) or whether we are using the loaded data for prediction

Return values
mixed

packB()

Pack the bias vector represented as an array into a string

public packB() : string
Return values
string

the bias vector packed as a string

packT()

Pack the tag_feature represented as an array into a string

public packT(int $key) : string
Parameters
$key : int

in tag_feature set corresponding to a part of speech

Return values
string

packed tag_feature vector

packW()

Pack the weights matrix to a string for a particular part of speech key

public packW(int $key) : string
Parameters
$key : int

index corresponding to a part of speech according to $this->tag_set

Return values
string

the packed weights matrix

predict()

Predicts a tagging for all elements of $sentence

public abstract predict(mixed $sentence) : array<string|int, mixed>
Parameters
$sentence : mixed

is an array of segmented terms/chars or a string that will be split on white space

Return values
array<string|int, mixed>

predicted tags. The ith entry in the returned results is the tag of ith element of $sentence

processTexts()

Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]

public static processTexts(mixed $text_files[, string $term_tag_separator = "_" ][, function $term_callback = null ][, function $tag_callback = null ][, bool $tag_on_array_chars = false ]) : array<string|int, mixed>
Parameters
$text_files : mixed

can be a file or an array of file names

$term_tag_separator : string = "_"

separator used to separate term and tag for terms in input sentence

$term_callback : function = null

callback function applied to a term before adding term to sentence term array

$tag_callback : function = null

callback function applied to a part of speech tag before adding tag to sentence tag array

$tag_on_array_chars : bool = false

for some kinds of text processing it better to assume the tags are applied to each char within a term rather than at the term level. For example, we might want to use char within a term for name entity tagging. THis flag if true says to do this; otherwise don't

Return values
array<string|int, mixed>

of separated sentences, each sentence having the format of [[terms...], [tags...]] Currently, the training data needs to fit Chinese Treebank format: term followed by a underscore and followed by the tag e.g. "新_VA 的_DEC 南斯拉夫_NR 会国_NN" To adapt to other language, some modifications are needed

saveWeights()

Save the trained weight to disk

public saveWeights() : mixed
Return values
mixed

setB()

Set the bias value for tag

public setB(int $tag_index, float $value) : mixed
Parameters
$tag_index : int

the index of tag's value within the bias string

$value : float

bias value to associate to tag

Return values
mixed

tag()

Tags a sequence of strings according to this tagger's predict method returning the tagged result as a string.

public tag(string $text[, string $tag_separator = "_" ]) : string

This function is mainly used to facilitate unit testing of taggers.

Parameters
$text : string

to be tagged

$tag_separator : string = "_"

terms in the output string will be the terms from the input texts followed by $tag_separator followed by their tag. So if $tag_separator == "_", then a term 中国 in the input texts might be 中国_NR in the output string

Return values
string

single string where terms in the input texts have been tagged. For example output might look like: 中国_NR 人民_NN 将_AD 满怀信心_VV 地_DEV 开创_VV 新_VA 的_DEC 业绩_NN 。_PU

train()

Uses text files to train a tagger for terms or chars in a document

public abstract train(mixed $text_files[, string $term_tag_separator = "-" ][, float $learning_rate = 0.1 ][, int $num_epoch = 1200 ][, function $term_callback = null ][, function $tag_callback = null ][, mixed $resume = false ]) : mixed
Parameters
$text_files : mixed

with training data. These can be a file or an array of file names.

$term_tag_separator : string = "-"

separator used to separate term and tag for terms in input sentence

$learning_rate : float = 0.1

learning rate when cycling over data trying to minimize the cross-entropy loss in the prediction of the tag of the middle term.

$num_epoch : int = 1200

number of times to cycle through the complete data set. Default value of 1200 seems to avoid overfitting

$term_callback : function = null

callback function applied to a term before adding term to sentence term array as part of processing and training with a sentence.

$tag_callback : function = null

callback function applied to a part of speech tag before adding tag to sentence tag array as part of processing and training with a sentence.

$resume : mixed = false
Return values
mixed

unpackB()

Unpack the bias represented as a string into an array

public unpackB() : array<string|int, mixed>
Return values
array<string|int, mixed>

the bias vector unpacked from a string

unpackT()

Unpack the tag_feature represented as a string into an array

public unpackT(int $key) : array<string|int, mixed>
Parameters
$key : int

in tag_feature set corresponding to a part of speech

Return values
array<string|int, mixed>

unpacked tag_feature vector

unpackW()

Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.

public unpackW(int $key) : array<string|int, mixed>

An (i, j) entry roughly gives the probability of the j term in location i having the part of speech given by $key

Parameters
$key : int

in word_feature set corresponding to a part of speech

Return values
array<string|int, mixed>

of weights corresponding to that key


        

Search results