Yioop_V9.5_Source_Code_Documentation

ClassifierTool
in package

Class used to encapsulate all the activities of the ClassifierTool.php command line script. This script allows one to automate the building and testing of classifiers, providing an alternative to the web interface when

a labeled training set is available.

Tags
author

Shawn Tice

Table of Contents

$options  : array<string|int, mixed>
Options to be used by activities and constructed classifiers. These options can be overridden by supplying an appropriate flag on the command line, where nesting is denoted by a period (e.g., cls.chi2.max).
$classifier_controller  : object
Reference to a classifier controller, used to manipulate crawl mixes in the same way that the controller that handles web requests does.
$crawl_model  : object
Reference to a crawl model object, also used to manipulate crawl mixes.
__construct()  : mixed
Initializes the classifier controller and crawl model that will be used to manage crawl mixes, used for iterating over labeled examples.
deleteClassifier()  : mixed
Deletes an existing classifier, specified by its label.
isTestPoint()  : bool
Determines whether to run a classification test after a certain number of documents have been added to the training set. Whether or not to test is determined by the `test_interval' option, which may be either null, an integer, or a string. In the first case, testing only occurs after all training examples have been added; in the second case, testing occurs each time an additional constant number of training examples have been added; and in the final case, testing occurs on a fixed schedule of comma-separated offsets, such as "10,25,50,100".
loadDataset()  : array<string|int, mixed>
Fetches the summaries for pages in the indices specified by the passed dataset name. This method looks for existing indexes with names matching the dataset name prefix, and with suffix either "pos" or "neg" (ignoring case). The pages in these indexes are shuffled into one large array, and augmented with a TRUE_LABEL field that records which set they came from originally. The shuffled array is then split according to the `split' option, and all pages up to (but not including) the split index are used for the training set; the remaining pages are used for the test set.
log()  : mixed
Writes out logging information according to a detail level. The first argument is an integer (potentially negative) indicating the level of detail for the log message, where larger numbers indicate greater detail. Each message is prefixed with a character according to its level of detail, but if the detail level is greater than the level specified by the `debug' option then nothing is printed. The treatment for the available detail levels are as follows:
logOptions()  : mixed
Logs the current options using the log method of this class. This method is used to explicitly state which settings were used for a given run of an activity. The detail level passed to the log method is -1.
main()  : mixed
Parses the options, and if an appropriate activity exists, calls the activity, passing in the label and dataset to be used; otherwise, prints an error and exits.
makeFreshClassifier()  : object
Creates a new classifier for a label, first deleting any existing classifier with the same label.
parseOptions()  : array<string|int, mixed>
Parses the command-line options, returns the required arguments, and updates the member variable $options with any parameters. If any of the required arguments (activity, dataset, or label) are missing, then a message is printed and the program exits. The optional arguments used to set parameters directly modify the class state through the setOptions method.
runActiveTrainAndTest()  : mixed
Like the TrainAndTest activity, but uses active training in order to choose the documents to add to the training set. The method simulates the process that an actual user would go through in order to label documents for addition to the training set, then tests performance at the specified intervals.
runTrainAndTest()  : mixed
Trains a classifier on a data set, testing at the specified intervals.
setDefault()  : mixed
Sets a default value for a runtime parameter. This method is used by activities to specify default values that may be overridden by passing the appropriate command-line flag.
setOptions()  : mixed
Sets one or more options of the form NAME=VALUE according to a converter such as intval, floatval, and so on. The options may be passed in either as a string (a single option) or as an array of strings, where each string corresponds to an option of the same type (e.g., int).
testClassifier()  : mixed
Finalizes the current classifier, uses it to classify all test documents, and logs the classification error. The current classifier is saved to disk after finalizing (though not before), and left in `classify' mode. The iterator over the test dataset is reset for the next round of testing (if any).

Properties

$options

Options to be used by activities and constructed classifiers. These options can be overridden by supplying an appropriate flag on the command line, where nesting is denoted by a period (e.g., cls.chi2.max).

public array<string|int, mixed> $options = ['debug' => 0, 'max_train' => null, 'test_interval' => null, 'split' => 3000, 'cls' => ['use_nb' => false, 'chi2' => ['max' => 200]]]

The supported options are:

debug: An integer, the level of debug statements to print. Larger integers specify more detailed debug output; the default value of 0 indicates no debug output.

max_train: An integer, the maximum number of examples to use when training a classifier. The default value of null indicates that all available training examples should be used.

test_interval: An integer, the number of new training examples to be added before a round of testing on ALL test instances is to be executed. With an interval of 5, for example, after adding five new training examples, the classifier would be finalized and used to classify all test instances. The error is reported for each round of testing. The default value of null indicates that testing should only occur after all training examples have been added.

split: An integer, the number of examples from the entire set of labeled examples to use for training. The remainder are used for testing.

cls.use_nb: A boolean, whether or not to use the Naive Bayes classification algorithm instead of the logistic regression one in order to finalize the classifier. The default value is false, indicating that logistic regression should be used.

cls.chi2.max: An integer, the maximum number of features to use when training the classifier. The default is a relatively conservative 200.

$classifier_controller

Reference to a classifier controller, used to manipulate crawl mixes in the same way that the controller that handles web requests does.

protected object $classifier_controller

$crawl_model

Reference to a crawl model object, also used to manipulate crawl mixes.

protected object $crawl_model

Methods

__construct()

Initializes the classifier controller and crawl model that will be used to manage crawl mixes, used for iterating over labeled examples.

public __construct() : mixed
Return values
mixed

deleteClassifier()

Deletes an existing classifier, specified by its label.

public deleteClassifier(string $label) : mixed
Parameters
$label : string

class label of the existing classifier

Return values
mixed

isTestPoint()

Determines whether to run a classification test after a certain number of documents have been added to the training set. Whether or not to test is determined by the `test_interval' option, which may be either null, an integer, or a string. In the first case, testing only occurs after all training examples have been added; in the second case, testing occurs each time an additional constant number of training examples have been added; and in the final case, testing occurs on a fixed schedule of comma-separated offsets, such as "10,25,50,100".

public isTestPoint(int $i, int $total) : bool
Parameters
$i : int

the size of the current training set

$total : int

the total number of documents available to be added to the training set

Return values
bool

true if the `test_interval' option specifies that a round of testing should occur for the current training offset, and false otherwise

loadDataset()

Fetches the summaries for pages in the indices specified by the passed dataset name. This method looks for existing indexes with names matching the dataset name prefix, and with suffix either "pos" or "neg" (ignoring case). The pages in these indexes are shuffled into one large array, and augmented with a TRUE_LABEL field that records which set they came from originally. The shuffled array is then split according to the `split' option, and all pages up to (but not including) the split index are used for the training set; the remaining pages are used for the test set.

public loadDataset(string $dataset_name, string $class_label) : array<string|int, mixed>
Parameters
$dataset_name : string

prefix of index names to draw examples from

$class_label : string

class label of the classifier the examples will be used to train (used to name the crawl mix that iterates over each index)

Return values
array<string|int, mixed>

training and test datasets in an associative array with keys train' and test', where each dataset is wrapped up in a PageIterator that implements the CrawlMixIterator interface.

log()

Writes out logging information according to a detail level. The first argument is an integer (potentially negative) indicating the level of detail for the log message, where larger numbers indicate greater detail. Each message is prefixed with a character according to its level of detail, but if the detail level is greater than the level specified by the `debug' option then nothing is printed. The treatment for the available detail levels are as follows:

public log() : mixed

-2: Used for errors; always printed; prefix '! ' -1: Used for log of set options; always printed; prefix '# ' 0+: Used for normal messages; prefix '> '

The second argument is a printf-style string template specifying the message, and each following (optional) argument is used by the template. A newline is added automatically to each message.

Return values
mixed

logOptions()

Logs the current options using the log method of this class. This method is used to explicitly state which settings were used for a given run of an activity. The detail level passed to the log method is -1.

public logOptions([string $root = null ][, string $prefix = '' ]) : mixed
Parameters
$root : string = null

folder to write to

$prefix : string = ''

to pre message (like Warning) to put at start of log message

Return values
mixed

main()

Parses the options, and if an appropriate activity exists, calls the activity, passing in the label and dataset to be used; otherwise, prints an error and exits.

public main() : mixed
Return values
mixed

makeFreshClassifier()

Creates a new classifier for a label, first deleting any existing classifier with the same label.

public makeFreshClassifier(string $label) : object
Parameters
$label : string

class label of the new classifier

Return values
object

created classifier instance

parseOptions()

Parses the command-line options, returns the required arguments, and updates the member variable $options with any parameters. If any of the required arguments (activity, dataset, or label) are missing, then a message is printed and the program exits. The optional arguments used to set parameters directly modify the class state through the setOptions method.

public parseOptions() : array<string|int, mixed>
Return values
array<string|int, mixed>

the parsed activity, dataset, and label

runActiveTrainAndTest()

Like the TrainAndTest activity, but uses active training in order to choose the documents to add to the training set. The method simulates the process that an actual user would go through in order to label documents for addition to the training set, then tests performance at the specified intervals.

public runActiveTrainAndTest(string $label, string $dataset_name) : mixed
Parameters
$label : string

class label of the new classifier

$dataset_name : string

name of the dataset to train and test on

Return values
mixed

runTrainAndTest()

Trains a classifier on a data set, testing at the specified intervals.

public runTrainAndTest(string $label, string $dataset_name) : mixed

The testing interval is set by the test_interval parameter. Each time this activity is run a new classifier is created (replacing an old one with the same label, if necessary), and the classifier remains at the end.

Parameters
$label : string

class label of the new classifier

$dataset_name : string

name of the dataset to train and test on

Return values
mixed

setDefault()

Sets a default value for a runtime parameter. This method is used by activities to specify default values that may be overridden by passing the appropriate command-line flag.

public setDefault(string $name, string $value) : mixed
Parameters
$name : string

should end with name of runtime parameter to set

$value : string

what to set it to

Return values
mixed

setOptions()

Sets one or more options of the form NAME=VALUE according to a converter such as intval, floatval, and so on. The options may be passed in either as a string (a single option) or as an array of strings, where each string corresponds to an option of the same type (e.g., int).

public setOptions(string|array<string|int, mixed> $opts[, string $converter = null ]) : mixed
Parameters
$opts : string|array<string|int, mixed>

single option in the format NAME=VALUE, or array of options, each for the same target type (e.g., int)

$converter : string = null

the name of a function that takes a string and casts it to a particular type (e.g., intval, floatval)

Return values
mixed

testClassifier()

Finalizes the current classifier, uses it to classify all test documents, and logs the classification error. The current classifier is saved to disk after finalizing (though not before), and left in `classify' mode. The iterator over the test dataset is reset for the next round of testing (if any).

public testClassifier(object $classifier, array<string|int, mixed> $data) : mixed
Parameters
$classifier : object

classifier instance to test

$data : array<string|int, mixed>

the array of training and test datasets, constructed by loadDataset, of which only the `test' dataset it used.

Return values
mixed

        

Search results