ClassifierTool
in package
Class used to encapsulate all the activities of the ClassifierTool.php command line script. This script allows one to automate the building and testing of classifiers, providing an alternative to the web interface when
a labeled training set is available.
Tags
Table of Contents
- $options : array<string|int, mixed>
- Options to be used by activities and constructed classifiers. These options can be overridden by supplying an appropriate flag on the command line, where nesting is denoted by a period (e.g., cls.chi2.max).
- $classifier_controller : object
- Reference to a classifier controller, used to manipulate crawl mixes in the same way that the controller that handles web requests does.
- $crawl_model : object
- Reference to a crawl model object, also used to manipulate crawl mixes.
- __construct() : mixed
- Initializes the classifier controller and crawl model that will be used to manage crawl mixes, used for iterating over labeled examples.
- deleteClassifier() : mixed
- Deletes an existing classifier, specified by its label.
- isTestPoint() : bool
- Determines whether to run a classification test after a certain number of documents have been added to the training set. Whether or not to test is determined by the `test_interval' option, which may be either null, an integer, or a string. In the first case, testing only occurs after all training examples have been added; in the second case, testing occurs each time an additional constant number of training examples have been added; and in the final case, testing occurs on a fixed schedule of comma-separated offsets, such as "10,25,50,100".
- loadDataset() : array<string|int, mixed>
- Fetches the summaries for pages in the indices specified by the passed dataset name. This method looks for existing indexes with names matching the dataset name prefix, and with suffix either "pos" or "neg" (ignoring case). The pages in these indexes are shuffled into one large array, and augmented with a TRUE_LABEL field that records which set they came from originally. The shuffled array is then split according to the `split' option, and all pages up to (but not including) the split index are used for the training set; the remaining pages are used for the test set.
- log() : mixed
- Writes out logging information according to a detail level. The first argument is an integer (potentially negative) indicating the level of detail for the log message, where larger numbers indicate greater detail. Each message is prefixed with a character according to its level of detail, but if the detail level is greater than the level specified by the `debug' option then nothing is printed. The treatment for the available detail levels are as follows:
- logOptions() : mixed
- Logs the current options using the log method of this class. This method is used to explicitly state which settings were used for a given run of an activity. The detail level passed to the log method is -1.
- main() : mixed
- Parses the options, and if an appropriate activity exists, calls the activity, passing in the label and dataset to be used; otherwise, prints an error and exits.
- makeFreshClassifier() : object
- Creates a new classifier for a label, first deleting any existing classifier with the same label.
- parseOptions() : array<string|int, mixed>
- Parses the command-line options, returns the required arguments, and updates the member variable $options with any parameters. If any of the required arguments (activity, dataset, or label) are missing, then a message is printed and the program exits. The optional arguments used to set parameters directly modify the class state through the setOptions method.
- runActiveTrainAndTest() : mixed
- Like the TrainAndTest activity, but uses active training in order to choose the documents to add to the training set. The method simulates the process that an actual user would go through in order to label documents for addition to the training set, then tests performance at the specified intervals.
- runTrainAndTest() : mixed
- Trains a classifier on a data set, testing at the specified intervals.
- setDefault() : mixed
- Sets a default value for a runtime parameter. This method is used by activities to specify default values that may be overridden by passing the appropriate command-line flag.
- setOptions() : mixed
- Sets one or more options of the form NAME=VALUE according to a converter such as intval, floatval, and so on. The options may be passed in either as a string (a single option) or as an array of strings, where each string corresponds to an option of the same type (e.g., int).
- testClassifier() : mixed
- Finalizes the current classifier, uses it to classify all test documents, and logs the classification error. The current classifier is saved to disk after finalizing (though not before), and left in `classify' mode. The iterator over the test dataset is reset for the next round of testing (if any).
Properties
$options
Options to be used by activities and constructed classifiers. These options can be overridden by supplying an appropriate flag on the command line, where nesting is denoted by a period (e.g., cls.chi2.max).
public
array<string|int, mixed>
$options
= ['debug' => 0, 'max_train' => null, 'test_interval' => null, 'split' => 3000, 'cls' => ['use_nb' => false, 'chi2' => ['max' => 200]]]
The supported options are:
debug: An integer, the level of debug statements to print. Larger integers specify more detailed debug output; the default value of 0 indicates no debug output.
max_train: An integer, the maximum number of examples to use when training a classifier. The default value of null indicates that all available training examples should be used.
test_interval: An integer, the number of new training examples to be added before a round of testing on ALL test instances is to be executed. With an interval of 5, for example, after adding five new training examples, the classifier would be finalized and used to classify all test instances. The error is reported for each round of testing. The default value of null indicates that testing should only occur after all training examples have been added.
split: An integer, the number of examples from the entire set of labeled examples to use for training. The remainder are used for testing.
cls.use_nb: A boolean, whether or not to use the Naive Bayes classification algorithm instead of the logistic regression one in order to finalize the classifier. The default value is false, indicating that logistic regression should be used.
cls.chi2.max: An integer, the maximum number of features to use when training the classifier. The default is a relatively conservative 200.
$classifier_controller
Reference to a classifier controller, used to manipulate crawl mixes in the same way that the controller that handles web requests does.
protected
object
$classifier_controller
$crawl_model
Reference to a crawl model object, also used to manipulate crawl mixes.
protected
object
$crawl_model
Methods
__construct()
Initializes the classifier controller and crawl model that will be used to manage crawl mixes, used for iterating over labeled examples.
public
__construct() : mixed
Return values
mixed —deleteClassifier()
Deletes an existing classifier, specified by its label.
public
deleteClassifier(string $label) : mixed
Parameters
- $label : string
-
class label of the existing classifier
Return values
mixed —isTestPoint()
Determines whether to run a classification test after a certain number of documents have been added to the training set. Whether or not to test is determined by the `test_interval' option, which may be either null, an integer, or a string. In the first case, testing only occurs after all training examples have been added; in the second case, testing occurs each time an additional constant number of training examples have been added; and in the final case, testing occurs on a fixed schedule of comma-separated offsets, such as "10,25,50,100".
public
isTestPoint(int $i, int $total) : bool
Parameters
- $i : int
-
the size of the current training set
- $total : int
-
the total number of documents available to be added to the training set
Return values
bool —true if the `test_interval' option specifies that a round of testing should occur for the current training offset, and false otherwise
loadDataset()
Fetches the summaries for pages in the indices specified by the passed dataset name. This method looks for existing indexes with names matching the dataset name prefix, and with suffix either "pos" or "neg" (ignoring case). The pages in these indexes are shuffled into one large array, and augmented with a TRUE_LABEL field that records which set they came from originally. The shuffled array is then split according to the `split' option, and all pages up to (but not including) the split index are used for the training set; the remaining pages are used for the test set.
public
loadDataset(string $dataset_name, string $class_label) : array<string|int, mixed>
Parameters
- $dataset_name : string
-
prefix of index names to draw examples from
- $class_label : string
-
class label of the classifier the examples will be used to train (used to name the crawl mix that iterates over each index)
Return values
array<string|int, mixed> —training and test datasets in an associative array with
keys train' and
test', where each dataset is wrapped up in a
PageIterator that implements the CrawlMixIterator interface.
log()
Writes out logging information according to a detail level. The first argument is an integer (potentially negative) indicating the level of detail for the log message, where larger numbers indicate greater detail. Each message is prefixed with a character according to its level of detail, but if the detail level is greater than the level specified by the `debug' option then nothing is printed. The treatment for the available detail levels are as follows:
public
log() : mixed
-2: Used for errors; always printed; prefix '! ' -1: Used for log of set options; always printed; prefix '# ' 0+: Used for normal messages; prefix '> '
The second argument is a printf-style string template specifying the message, and each following (optional) argument is used by the template. A newline is added automatically to each message.
Return values
mixed —logOptions()
Logs the current options using the log method of this class. This method is used to explicitly state which settings were used for a given run of an activity. The detail level passed to the log method is -1.
public
logOptions([string $root = null ][, string $prefix = '' ]) : mixed
Parameters
- $root : string = null
-
folder to write to
- $prefix : string = ''
-
to pre message (like Warning) to put at start of log message
Return values
mixed —main()
Parses the options, and if an appropriate activity exists, calls the activity, passing in the label and dataset to be used; otherwise, prints an error and exits.
public
main() : mixed
Return values
mixed —makeFreshClassifier()
Creates a new classifier for a label, first deleting any existing classifier with the same label.
public
makeFreshClassifier(string $label) : object
Parameters
- $label : string
-
class label of the new classifier
Return values
object —created classifier instance
parseOptions()
Parses the command-line options, returns the required arguments, and updates the member variable $options with any parameters. If any of the required arguments (activity, dataset, or label) are missing, then a message is printed and the program exits. The optional arguments used to set parameters directly modify the class state through the setOptions method.
public
parseOptions() : array<string|int, mixed>
Return values
array<string|int, mixed> —the parsed activity, dataset, and label
runActiveTrainAndTest()
Like the TrainAndTest activity, but uses active training in order to choose the documents to add to the training set. The method simulates the process that an actual user would go through in order to label documents for addition to the training set, then tests performance at the specified intervals.
public
runActiveTrainAndTest(string $label, string $dataset_name) : mixed
Parameters
- $label : string
-
class label of the new classifier
- $dataset_name : string
-
name of the dataset to train and test on
Return values
mixed —runTrainAndTest()
Trains a classifier on a data set, testing at the specified intervals.
public
runTrainAndTest(string $label, string $dataset_name) : mixed
The testing interval is set by the test_interval parameter. Each time this activity is run a new classifier is created (replacing an old one with the same label, if necessary), and the classifier remains at the end.
Parameters
- $label : string
-
class label of the new classifier
- $dataset_name : string
-
name of the dataset to train and test on
Return values
mixed —setDefault()
Sets a default value for a runtime parameter. This method is used by activities to specify default values that may be overridden by passing the appropriate command-line flag.
public
setDefault(string $name, string $value) : mixed
Parameters
- $name : string
-
should end with name of runtime parameter to set
- $value : string
-
what to set it to
Return values
mixed —setOptions()
Sets one or more options of the form NAME=VALUE according to a converter such as intval, floatval, and so on. The options may be passed in either as a string (a single option) or as an array of strings, where each string corresponds to an option of the same type (e.g., int).
public
setOptions(string|array<string|int, mixed> $opts[, string $converter = null ]) : mixed
Parameters
- $opts : string|array<string|int, mixed>
-
single option in the format NAME=VALUE, or array of options, each for the same target type (e.g., int)
- $converter : string = null
-
the name of a function that takes a string and casts it to a particular type (e.g., intval, floatval)
Return values
mixed —testClassifier()
Finalizes the current classifier, uses it to classify all test documents, and logs the classification error. The current classifier is saved to disk after finalizing (though not before), and left in `classify' mode. The iterator over the test dataset is reset for the next round of testing (if any).
public
testClassifier(object $classifier, array<string|int, mixed> $data) : mixed
Parameters
- $classifier : object
-
classifier instance to test
- $data : array<string|int, mixed>
-
the array of training and test datasets, constructed by loadDataset, of which only the `test' dataset it used.