Classifier
in package
implements
CrawlConstants
The primary interface for building and using classifiers. An instance of this class represents a single classifier in memory, but the class also provides static methods to manage classifiers on disk.
A single classifier is a tool for determining the likelihood that a document is a positive instance of a particular class. In order to do this, a classifier goes through a training phase on a labeled training set where it learns weights for document features (terms, for our purposes). To classify a new document, the learned weights for all terms in the document are combined in order to yield a pdeudo-probability that the document belongs to the class.
A classifier is composed of a candidate buffer, a training set, a set of features, and a classification algorithm. In addition to the set of all features, there is a restricted set of features used for training and classification. There are also two classification algorithms: a Naive Bayes algorithm used during labeling, and a logistic regression algorithm used to train the final classifier. In general, a fresh classifier will first go through a labeling phase where a collection of labeled training documents is built up out of existing crawl indexes, and then a finalization phase where the logistic regression algorithm will be trained on the training set established in the first phase. After finalization, the classifier may be used to classify new web pages during a crawl.
During the labeling phase, the classifier fills a buffer of candidate pages from the user-selected index (optionally restricted by a query), and tries to pick the best one to present to the user to be labeled (here `best' means the one that, once labeled, is most likely to improve classification accuracy). Each labeled document is removed from the buffer, converted to a feature vector (described next), and added to the training set. The expanded training set is then used to train an intermediate Naive Bayes classification algorithm that is in turn used to more accurately identify good candidates for the next round of labeling. This phase continues until the user gets tired of labeling documents, or is happy with the estimated classification accuracy.
Instead of passing around terms everywhere, each document that goes into the training set is first mapped through a Features instance that maps terms to feature indices (e.g. "Pythagorean" => 1, "theorem" => 2, etc.). These feature indices are used internally by the classification algorithms, and by the algorithms that try to pick out the most informative features. In addition to keeping track of the mapping between terms and feature indices, a Features instance keeps term and label statistics (such as how often a term occurs in documents with a particular label) used to weight features within a document and to select informative features. Finally, subclasses of the Features class weight features in different ways, presenting more or less of everything that's known about the frequency or informativeness of a feature to classification algorithms.
Once a sufficiently-useful training set has been built, a FeatureSelection instance is used to choose the most informative features, and copy these into a reduced Features instance that has a much smaller vocabulary, and thus a much smaller memory footprint. For efficiency, this is the Features instance used to train classification algorithms, and to classify web pages. Finalization is just the process of training a logistic regression classification algorithm on the full training set. This results in a set of feature weights that can be used to efficiently assign a psuedo-probability to the proposition that a new web page is a positive instance of the class that the classifier has been trained to recognize. Training logistic regression on a large training set can take a long time, so this phase is carried out asynchronously, by a daemon launched in response to the finalization request.
Because the full Features instance, buffer, and training set are only needed during the labeling and finalization phases, and because they can get very large and take up a lot of space in memory, this class separates its large instance members into separate files when serializing to disk. When a classifier is first loaded into memory from disk it brings along only its summary statistics, since these are all that are needed to, for example, display a list of classifiers. In order to actually add new documents to the training set, finalize, or classify, the classifier must first be explicitly told to load the relevant data structures from disk; this is accomplished by methods like prepareToLabel and prepareToClassify. These methods load in the relevant serialized structures, and mark the associated data members for storage back to disk when (or if) the classifier is serialized again.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- BUFFER_SIZE = 51
- The maximum number of candidate documents to consider at once in order to find the best candidate.
- COMMITTEE_SIZE = 3
- The number of Naive Bayes instances to use to calculate disagreement during candidate selection.
- DENSITY_BETA = 3.0
- Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).
- DENSITY_LAMBDA = 0.5
- Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).
- FINALIZED = 2
- Indicates that a classifier has been finalized, and is ready to be used for classification.
- FINALIZING = 1
- Indicates that a classifier is currently being finalized (this may take a while).
- MAX_DISAGREEMENT = 1.63652
- The maximum disagreement score between candidates. This number depends on committee size, and is used to provide a slightly more user-friendly estimate of how much disagreement a document causes (between 0 and 1).
- THRESHOLD = 0.5
- Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= THRESHOLD are classified as positive instances.
- UNFINALIZED = 0
- Indicates that a classifier needs to be finalized before it can be used.
- $accuracy : float
- The estimated classification accuracy. This member may be null if the accuracy has not yet been estimated, or out of date if examples have been added to the training set since the last accuracy update, but no new estimate has been computed.
- $buffer : array<string|int, mixed>
- The current pool of candidates for labeling. The first element in the buffer is always the active document, and as active documents are labeled and removed, the pool is refreshed with new candidates (if there are more pages to be drawn from the active index). The buffer is represented as an associative array with three fields: 'docs', the candidate page summaries; 'densities', an array of densities computed for the documents in the candidate pool; and 'stats', statistics about the terms and documents in the current pool.
- $class_label : string
- The label applied to positive instances of the class learned by this classifier (e.g., `spam').
- $docs : array<string|int, mixed>
- The training set, broken up into two fields of an associative array: 'features', an array of document feature vectors; and 'labels', the labels assigned to each document.
- $final_algorithm : object
- The finalized classification algorithm that will be used to classify new web pages. Will usually be logistic regression, but may be Naive Bayes, if set by the options. During labeling, this field is a reference to the Naive Bayes classification algorithm (so that that algorithm will be used by the `classify' method), but it won't be saved to disk as such.
- $final_features : object
- The Features subclass instance used to map documents at classification time to the feature vectors expected by classification algorithms. This will generally be a reduced feature set, just like that used during labeling, but potentially larger than the set used by Naive Bayes.
- $finalized : int
- Finalization status, as determined by one of the three finalization constants.
- $fresh : bool
- Whether or not this classifier has had any training examples added to it, and consequently whether or not its Naive Bayes classification algorithm has every been trained.
- $full_features : object
- The Features subclass instance used to manage the full set of features seen across all documents in the training set.
- $label_algorithm : object
- The NaiveBayes classification algorithm used during training to tentatively classify documents presented to the user for labeling.
- $label_features : object
- The Features subclass instance used to manage the reduced set of features used only by Naive Bayes classification algorithms during the labeling phase.
- $lang : string
- Language of documents in the training set (also how new documents will be treated).
- $loaded_properties : array<string|int, mixed>
- The names of properties set by one of the prepareTo* methods; these properties will be saved back to disk during serialization, while all other properties not listed by the __sleep method will be discarded.
- $negative : int
- The number of negative examples in the training set.
- $options : array<string|int, mixed>
- Default per-classifier options, which may be overridden when constructing a new classifier. The supported options are:
- $positive : int
- The number of positive examples in the training set.
- $timestamp : int
- Creation time as a UNIX timestamp.
- $total : int
- The total number of examples in the training set (sum of positive and negative).
- __construct() : mixed
- Initializes a new classifier with a class label, and options to override the defaults. The timestamp associated with the classifier is taken from the time of construction.
- __sleep() : array<string|int, mixed>
- Magic method that determines which member data will be stored when serializing this class. Only lightweight summary data are stored with the serialized version of this class. The heavier-weight properties are stored in individual, compressed files.
- addAllDocuments() : int
- Iterates entirely through a crawl mix iterator, adding each document (that hasn't already been labeled) to the training set with a single label. This function works by running through the iterator, filling up the candidate buffer with all unlabeled documents, then repeatedly dropping the first buffer document and adding it to the training set.
- addBufferDoc() : mixed
- Adds a page to the end of the candidate buffer, keeping the associated statistics up to date. During active training, each document in the buffer is tokenized, and the terms weighted by frequency; the term frequencies across documents in the buffer are tracked as well. With no active training, the buffer is simply an array of page summaries.
- classify() : float
- Classifies a page summary using the current final classification algorithm and features, and returns the classification score. This method is also used during the labeling phase to provide a tentative label for candidates, and in this case the final algorithm is actually a reference to a Naive Bayes instance and final_features is a reference to label_features; neither of these gets saved to disk, however.
- cleanLabel() : mixed
- Removes all but alphanumeric characters and underscores from a label, so that it may be easily saved to disk and used in queries as a meta word.
- computeBufferDensities() : mixed
- Computes from scratch the buffer densities of the documents in the current candidate pool. This is an expensive operation that requires the computation of the KL-divergence between each ordered pair of documents in the pool, approximately O(N^2) computations, total (where N is the number of documents in the pool). The densities are saved in the buffer data structure.
- deleteClassifier() : mixed
- Deletes the directory corresponding to a class label, and all of its contents. In the case that there is no classifier with the passed in label, does nothing.
- dropBufferDoc() : mixed
- Removes the document at the front of the candidate buffer. During active training the cross-document statistics for terms occurring in the document being removed are maintained.
- finalize() : mixed
- Trains the final classification algorithm on the full training set, using a subset of the full feature set. The final algorithm will usually be logistic regression, but can be set to Naive Bayes with the appropriate runtime option. Once finalization completes, updates the `finalized' attribute.
- findNextDocumentToLabel() : array<string|int, mixed>
- Finds the next best document for labeling amongst the documents in the candidate buffer, moves that candidate to the front of the buffer, and returns it. The best candidate is the one with the maximum product of disagreement and density, where the density has already been calculated for each document in the current pool, and the disagreement is the KL-divergence between the classification scores obtained from a committee of Naive Bayes classifiers, each sampled from the current set of features.
- getClassifier() : object
- Returns the minimal classifier instance corresponding to a class label, or null if no such classifier exists on disk.
- getClassifierList() : array<string|int, mixed>
- Returns an array of classifier instances currently stored in the classifiers directory. The array maps class labels to their corresponding classifiers, and each classifier is a minimal instance, containing only summary statistics.
- getCrawlMixName() : string
- Returns a name for the crawl mix associated with a class label.
- initBuffer() : int
- Drops any existing candidate buffer, re-initializes the buffer structure, then calls refreshBuffer to fill it. Takes an optional buffer size, which can be used to limit the buffer to something other than the number imposed by the runtime parameter. Returns the final buffer size.
- klDivergenceToMean() : float
- Calculates the KL-divergence to the mean for a collection of discrete two-element probability distributions. Each distribution is specified by a single probability, p, since the second probability is just 1 - p. The KL-divergence to the mean is used as a measure of disagreement between members of a committee of classifiers, where each member assigns a classification score to the same document.
- labelDocument() : bool
- Updates the buffer and training set to reflect the label given to a new document. The label may be -1, 1, or 0, where the first two correspond to a negative or positive example, and the last to a skip. The handling for a skip is necessarily different from that for a positive or negative label, and matters are further complicated by the possibility that we may be changing a label for a document that's already in the training set, rather than adding a new document. This function returns true if the new label resulted in a change to the training set, and false otherwise (i.e., if the user simply skipped labeling the candidate document).
- labelPage() : mixed
- Given a page summary (passed by reference) and a list of classifiers, augments the summary meta words with the class label of each classifier that scores the summary above a threshold. This static method is used by fetchers to classify downloaded pages. In addition to the class label, the pseudo-probability that the document belongs to the class is recorded as well. This is recorded both as the score rounded down to the nearest multiple of ten, and as "<n>plus" for each multiple of ten, n, less than the score and greater than or equal to the threshold.
- loadClassifiersData() : array<string|int, mixed>
- Given a list of class labels, returns an array mapping each class label to an array of data necessary for initializing a classifier for that label. This static method is used to prepare a collection of classifiers for distribution to fetchers, so that each fetcher can classify pages as it downloads them. The only extra properties passed along in addition to the base classification data are the final features and final algorithm, both necessary for classifying new documents.
- loadProperties() : mixed
- Loads class attributes from compressed, serialized files on disk, and stores their names so that they will be saved back to disk later. Each property (if it has been previously set) is stored in its own file under the classifier's data directory, named after the property. The file is compressed using gzip, but without gzip headers, so it can't actually be decompressed by the standard gzip utility. If a file doesn't exist, then the instance property is left untouched. The property names are passed as a variable number of arguments.
- makeKey() : string
- Returns a key that can be used internally to refer internally to a particular page summary.
- moveBufferDocToFront() : mixed
- Moves a document in the candidate buffer up to the front, in preparation for a label request. The document is specified by its index in the buffer.
- newClassifierFromData() : object
- The dual of loadClassifiersData, this static method reconstitutes a Classifier instance from an array containing the necessary data. This gets called by each fetcher, using the data that it receives from the name server when establishing a new crawl.
- prepareToClassify() : mixed
- Prepare to classify new web pages. This operation requires only the final features and classification algorithm, which are expected to be defined after the finalization phase.
- prepareToFinalize() : mixed
- Prepare to train a final classification algorithm on the full training set. This operation requires the full training set and features, but not the candidate buffer used during labeling. Note that any existing final features and classification algorithm are simply zeroed out; they are only loaded from disk so that they will be written back after finalization completes.
- prepareToLabel() : mixed
- Prepare this classifier instance for labeling. This operation requires all of the heavyweight member data save the final features and algorithm. Note that these properties are set to references to the Naive Bayes features and algorithm, so that Naive Bayes will be used to tentatively classify documents during labeling (purely to give the user some feedback on how the training set is performing).
- refreshBuffer() : int
- Adds as many new documents to the candidate buffer as necessary to reach the specified buffer size, which defaults to the runtime parameter.
- setClassifier() : mixed
- Stores a classifier instance to disk, first separating it out into individual files containing serialized and compressed property data. The basic classifier information, such as class label and summary statistics, is stored uncompressed in a file called `classifier.txt'.
- storeLoadedProperties() : mixed
- Stores the data associated with each property name listed in the loaded_properties instance attribute back to disk. The data for each property is stored in its own serialized and compressed file, and made world-writable.
- tokenizeDescription() : array<string|int, mixed>
- Tokenizes a string into a map from terms to within-string frequencies.
- train() : mixed
- Trains the Naive Bayes classification algorithm used during labeling on the current training set, and optionally updates the estimated accuracy.
- updateAccuracy() : mixed
- Estimates current classification accuracy using a Naive Bayes classification algorithm. Accuracy is estimated by splitting the current training set into fifths, reserving four fifths for training, and the remaining fifth for testing. A fresh classifier is trained and tested on these splits, and the total accuracy recorded. Then the splits are rotated so that the previous testing fifth becomes part of the training set, and one of the blocks from the previous training set becomes the testing set. A new classifier is trained and tested on the new splits, and, again, the accuracy recorded. This process is repeated until all blocks have been used for testing, and the average accuracy recorded.
Constants
BUFFER_SIZE
The maximum number of candidate documents to consider at once in order to find the best candidate.
public
mixed
BUFFER_SIZE
= 51
COMMITTEE_SIZE
The number of Naive Bayes instances to use to calculate disagreement during candidate selection.
public
mixed
COMMITTEE_SIZE
= 3
DENSITY_BETA
Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).
public
mixed
DENSITY_BETA
= 3.0
DENSITY_LAMBDA
Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).
public
mixed
DENSITY_LAMBDA
= 0.5
FINALIZED
Indicates that a classifier has been finalized, and is ready to be used for classification.
public
mixed
FINALIZED
= 2
FINALIZING
Indicates that a classifier is currently being finalized (this may take a while).
public
mixed
FINALIZING
= 1
MAX_DISAGREEMENT
The maximum disagreement score between candidates. This number depends on committee size, and is used to provide a slightly more user-friendly estimate of how much disagreement a document causes (between 0 and 1).
public
mixed
MAX_DISAGREEMENT
= 1.63652
THRESHOLD
Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= THRESHOLD are classified as positive instances.
public
mixed
THRESHOLD
= 0.5
UNFINALIZED
Indicates that a classifier needs to be finalized before it can be used.
public
mixed
UNFINALIZED
= 0
Properties
$accuracy
The estimated classification accuracy. This member may be null if the accuracy has not yet been estimated, or out of date if examples have been added to the training set since the last accuracy update, but no new estimate has been computed.
public
float
$accuracy
$buffer
The current pool of candidates for labeling. The first element in the buffer is always the active document, and as active documents are labeled and removed, the pool is refreshed with new candidates (if there are more pages to be drawn from the active index). The buffer is represented as an associative array with three fields: 'docs', the candidate page summaries; 'densities', an array of densities computed for the documents in the candidate pool; and 'stats', statistics about the terms and documents in the current pool.
public
array<string|int, mixed>
$buffer
$class_label
The label applied to positive instances of the class learned by this classifier (e.g., `spam').
public
string
$class_label
$docs
The training set, broken up into two fields of an associative array: 'features', an array of document feature vectors; and 'labels', the labels assigned to each document.
public
array<string|int, mixed>
$docs
$final_algorithm
The finalized classification algorithm that will be used to classify new web pages. Will usually be logistic regression, but may be Naive Bayes, if set by the options. During labeling, this field is a reference to the Naive Bayes classification algorithm (so that that algorithm will be used by the `classify' method), but it won't be saved to disk as such.
public
object
$final_algorithm
$final_features
The Features subclass instance used to map documents at classification time to the feature vectors expected by classification algorithms. This will generally be a reduced feature set, just like that used during labeling, but potentially larger than the set used by Naive Bayes.
public
object
$final_features
$finalized
Finalization status, as determined by one of the three finalization constants.
public
int
$finalized
= 0
$fresh
Whether or not this classifier has had any training examples added to it, and consequently whether or not its Naive Bayes classification algorithm has every been trained.
public
bool
$fresh
= true
$full_features
The Features subclass instance used to manage the full set of features seen across all documents in the training set.
public
object
$full_features
$label_algorithm
The NaiveBayes classification algorithm used during training to tentatively classify documents presented to the user for labeling.
public
object
$label_algorithm
$label_features
The Features subclass instance used to manage the reduced set of features used only by Naive Bayes classification algorithms during the labeling phase.
public
object
$label_features
$lang
Language of documents in the training set (also how new documents will be treated).
public
string
$lang
$loaded_properties
The names of properties set by one of the prepareTo* methods; these properties will be saved back to disk during serialization, while all other properties not listed by the __sleep method will be discarded.
public
array<string|int, mixed>
$loaded_properties
= []
$negative
The number of negative examples in the training set.
public
int
$negative
= 0
$options
Default per-classifier options, which may be overridden when constructing a new classifier. The supported options are:
public
array<string|int, mixed>
$options
= ['density' => ['lambda' => 0.5, 'beta' => 3.0], 'threshold' => 0.5, 'label_fs' => ['max' => 30], 'final_fs' => ['max' => 200], 'final_algo' => 'lr']
float density.lambda: Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).
float density.beta: Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).
int label_fs.max: Use the `label_fs' most informative features to train the Naive Bayes classifiers used during labeling to compute disagreement for a document.
float threshold: Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= `threshold' are classified as positive instances.
string final_algo: Algorithm to use for finalization; 'lr' for logistic regression, or 'nb' for Naive Bayes; default 'lr'.
int final_fs.max: Use the `final_fs' most informative features to train the final classifier.
$positive
The number of positive examples in the training set.
public
int
$positive
= 0
$timestamp
Creation time as a UNIX timestamp.
public
int
$timestamp
$total
The total number of examples in the training set (sum of positive and negative).
public
int
$total
= 0
Methods
__construct()
Initializes a new classifier with a class label, and options to override the defaults. The timestamp associated with the classifier is taken from the time of construction.
public
__construct(string $label[, array<string|int, mixed> $options = [] ]) : mixed
Parameters
- $label : string
-
class label applied to positive instances of the class this classifier is trained to recognize
- $options : array<string|int, mixed> = []
-
optional associative array of options that will override the default options
Return values
mixed —__sleep()
Magic method that determines which member data will be stored when serializing this class. Only lightweight summary data are stored with the serialized version of this class. The heavier-weight properties are stored in individual, compressed files.
public
__sleep() : array<string|int, mixed>
Return values
array<string|int, mixed> —names of properties to store when serializing this instance
addAllDocuments()
Iterates entirely through a crawl mix iterator, adding each document (that hasn't already been labeled) to the training set with a single label. This function works by running through the iterator, filling up the candidate buffer with all unlabeled documents, then repeatedly dropping the first buffer document and adding it to the training set.
public
addAllDocuments(object $mix_iterator, int $label[, int $limit = INF ]) : int
Returns the total number of newly-labeled documents.
Parameters
- $mix_iterator : object
-
crawl mix iterator to draw documents from
- $label : int
-
label to apply to every document; -1 or 1, but NOT 0
- $limit : int = INF
-
optional upper bound on the number of documents to add; defaults to no limit
Return values
int —total number of newly-labeled documents
addBufferDoc()
Adds a page to the end of the candidate buffer, keeping the associated statistics up to date. During active training, each document in the buffer is tokenized, and the terms weighted by frequency; the term frequencies across documents in the buffer are tracked as well. With no active training, the buffer is simply an array of page summaries.
public
addBufferDoc(array<string|int, mixed> $page[, bool $is_active = true ]) : mixed
Parameters
- $page : array<string|int, mixed>
-
page summary for the document to add to the buffer
- $is_active : bool = true
-
whether this operation is part of active training, in which case some extra statistics must be maintained
Return values
mixed —classify()
Classifies a page summary using the current final classification algorithm and features, and returns the classification score. This method is also used during the labeling phase to provide a tentative label for candidates, and in this case the final algorithm is actually a reference to a Naive Bayes instance and final_features is a reference to label_features; neither of these gets saved to disk, however.
public
classify(array<string|int, mixed> $page) : float
Parameters
- $page : array<string|int, mixed>
-
page summary array for the page to be classified
Return values
float —pseudo-probability that the page is a positive instance of the target class
cleanLabel()
Removes all but alphanumeric characters and underscores from a label, so that it may be easily saved to disk and used in queries as a meta word.
public
static cleanLabel(string $label) : mixed
Parameters
- $label : string
-
class label to clean
Return values
mixed —computeBufferDensities()
Computes from scratch the buffer densities of the documents in the current candidate pool. This is an expensive operation that requires the computation of the KL-divergence between each ordered pair of documents in the pool, approximately O(N^2) computations, total (where N is the number of documents in the pool). The densities are saved in the buffer data structure.
public
computeBufferDensities() : mixed
The density of a document is approximated by its average overlap with every other document in the candidate buffer, where the overlap between two documents is itself approximated using the exponential, negative KL-divergence between them. The KL-divergence is smoothed to deal with features (terms) that occur in one distribution (document) but not the other, and then multiplied by a negative constant and exponentiated in order to convert it to a kind of linear overlap score.
Return values
mixed —deleteClassifier()
Deletes the directory corresponding to a class label, and all of its contents. In the case that there is no classifier with the passed in label, does nothing.
public
static deleteClassifier(string $label) : mixed
Parameters
- $label : string
-
class label of the classifier to be deleted
Return values
mixed —dropBufferDoc()
Removes the document at the front of the candidate buffer. During active training the cross-document statistics for terms occurring in the document being removed are maintained.
public
dropBufferDoc([bool $is_active = true ]) : mixed
Parameters
- $is_active : bool = true
-
whether this operation is part of active training, in which case some extra statistics must be maintained
Return values
mixed —finalize()
Trains the final classification algorithm on the full training set, using a subset of the full feature set. The final algorithm will usually be logistic regression, but can be set to Naive Bayes with the appropriate runtime option. Once finalization completes, updates the `finalized' attribute.
public
finalize() : mixed
Return values
mixed —findNextDocumentToLabel()
Finds the next best document for labeling amongst the documents in the candidate buffer, moves that candidate to the front of the buffer, and returns it. The best candidate is the one with the maximum product of disagreement and density, where the density has already been calculated for each document in the current pool, and the disagreement is the KL-divergence between the classification scores obtained from a committee of Naive Bayes classifiers, each sampled from the current set of features.
public
findNextDocumentToLabel() : array<string|int, mixed>
Return values
array<string|int, mixed> —two-element array containing first the best candidate, and second the disagreement score, obtained by dividing the disagreement for the document by the maximum disagreement possible for the committee size
getClassifier()
Returns the minimal classifier instance corresponding to a class label, or null if no such classifier exists on disk.
public
static getClassifier(string $label) : object
Parameters
- $label : string
-
classifier's class label
Return values
object —classifier instance with the relevant class label, or null if no such classifier exists on disk
getClassifierList()
Returns an array of classifier instances currently stored in the classifiers directory. The array maps class labels to their corresponding classifiers, and each classifier is a minimal instance, containing only summary statistics.
public
static getClassifierList() : array<string|int, mixed>
Return values
array<string|int, mixed> —associative array of class labels mapped to their corresponding classifier instances
getCrawlMixName()
Returns a name for the crawl mix associated with a class label.
public
static getCrawlMixName(string $label) : string
Parameters
- $label : string
-
class label associated with the crawl mix
Return values
string —name that can be used for the crawl mix associated with $label
initBuffer()
Drops any existing candidate buffer, re-initializes the buffer structure, then calls refreshBuffer to fill it. Takes an optional buffer size, which can be used to limit the buffer to something other than the number imposed by the runtime parameter. Returns the final buffer size.
public
initBuffer(object $mix_iterator[, int $buffer_size = null ]) : int
Parameters
- $mix_iterator : object
-
crawl mix iterator to draw documents from
- $buffer_size : int = null
-
optional buffer size to use; defaults to the runtime parameter
Return values
int —final buffer size
klDivergenceToMean()
Calculates the KL-divergence to the mean for a collection of discrete two-element probability distributions. Each distribution is specified by a single probability, p, since the second probability is just 1 - p. The KL-divergence to the mean is used as a measure of disagreement between members of a committee of classifiers, where each member assigns a classification score to the same document.
public
static klDivergenceToMean(array<string|int, mixed> $ps) : float
Parameters
- $ps : array<string|int, mixed>
-
probabilities describing several discrete two-element probability distributions
Return values
float —KL-divergence to the mean for the collection of distributions
labelDocument()
Updates the buffer and training set to reflect the label given to a new document. The label may be -1, 1, or 0, where the first two correspond to a negative or positive example, and the last to a skip. The handling for a skip is necessarily different from that for a positive or negative label, and matters are further complicated by the possibility that we may be changing a label for a document that's already in the training set, rather than adding a new document. This function returns true if the new label resulted in a change to the training set, and false otherwise (i.e., if the user simply skipped labeling the candidate document).
public
labelDocument(string $key, int $label[, bool $is_active = true ]) : bool
When updating an existing document, we will either need to swap the label in the training set and update the statistics stored by the Features instance (since now the features are associated with a different label), or drop the document from the training set and (again) update the statistics stored by the Features instance. In either case the negative and positive counts must be updated as well.
When working with a new document, we need to remove it from the candidate buffer, and if the label is non-zero then we also need to add the document to the training set. That involves tokenizing the document, passing the tokens through the full_features instance, and storing the resulting feature vector, plus the new label in the docs attribute. The positive and negative counts must be updated as well.
Finally, if this operation is occurring active labeling (when the user is providing labels one at a time), that information needs to be passed along to dropBufferDoc, which can avoid doing some work in the non-active case.
Parameters
- $key : string
-
key used to select the document from the docs array
- $label : int
-
new label (-1, 1, or 0)
- $is_active : bool = true
-
whether this operation is being carried out during active labeling
Return values
bool —true if the training set was modified, and false otherwise
labelPage()
Given a page summary (passed by reference) and a list of classifiers, augments the summary meta words with the class label of each classifier that scores the summary above a threshold. This static method is used by fetchers to classify downloaded pages. In addition to the class label, the pseudo-probability that the document belongs to the class is recorded as well. This is recorded both as the score rounded down to the nearest multiple of ten, and as "<n>plus" for each multiple of ten, n, less than the score and greater than or equal to the threshold.
public
static labelPage(array<string|int, mixed> &$summary, array<string|int, mixed> $classifiers, array<string|int, mixed> &$active_classifiers, array<string|int, mixed> &$active_rankers) : mixed
As an example, suppose that a classifier with class label `label' has determined that a document is a positive example with pseudo-probability 0.87 and threshold 0.5. The following meta words are added to the summary: class:label, class:label:80, class:label:80plus, class:label:70plus, class:label:60plus, and class:label:50plus.
Parameters
- $summary : array<string|int, mixed>
-
page summary to classify, passed by reference
- $classifiers : array<string|int, mixed>
-
list of Classifier instances, each prepared for classifying (via the prepareToClassify method)
- $active_classifiers : array<string|int, mixed>
- $active_rankers : array<string|int, mixed>
Return values
mixed —loadClassifiersData()
Given a list of class labels, returns an array mapping each class label to an array of data necessary for initializing a classifier for that label. This static method is used to prepare a collection of classifiers for distribution to fetchers, so that each fetcher can classify pages as it downloads them. The only extra properties passed along in addition to the base classification data are the final features and final algorithm, both necessary for classifying new documents.
public
static loadClassifiersData(array<string|int, mixed> $labels) : array<string|int, mixed>
Parameters
- $labels : array<string|int, mixed>
-
flat array of class labels for which to load data
Return values
array<string|int, mixed> —associative array mapping class labels to arrays of data necessary for initializing the associated classifier
loadProperties()
Loads class attributes from compressed, serialized files on disk, and stores their names so that they will be saved back to disk later. Each property (if it has been previously set) is stored in its own file under the classifier's data directory, named after the property. The file is compressed using gzip, but without gzip headers, so it can't actually be decompressed by the standard gzip utility. If a file doesn't exist, then the instance property is left untouched. The property names are passed as a variable number of arguments.
public
loadProperties() : mixed
Return values
mixed —makeKey()
Returns a key that can be used internally to refer internally to a particular page summary.
public
static makeKey(array<string|int, mixed> $page) : string
Parameters
- $page : array<string|int, mixed>
-
page summary to return a key for
Return values
string —key that uniquely identifies the page summary
moveBufferDocToFront()
Moves a document in the candidate buffer up to the front, in preparation for a label request. The document is specified by its index in the buffer.
public
moveBufferDocToFront(int $i) : mixed
Parameters
- $i : int
-
document index within the candidate buffer
Return values
mixed —newClassifierFromData()
The dual of loadClassifiersData, this static method reconstitutes a Classifier instance from an array containing the necessary data. This gets called by each fetcher, using the data that it receives from the name server when establishing a new crawl.
public
static newClassifierFromData(array<string|int, mixed> $data) : object
Parameters
- $data : array<string|int, mixed>
-
associative array mapping property names to their serialized and compressed data
Return values
object —Classifier instance built from the passed-in data
prepareToClassify()
Prepare to classify new web pages. This operation requires only the final features and classification algorithm, which are expected to be defined after the finalization phase.
public
prepareToClassify() : mixed
Return values
mixed —prepareToFinalize()
Prepare to train a final classification algorithm on the full training set. This operation requires the full training set and features, but not the candidate buffer used during labeling. Note that any existing final features and classification algorithm are simply zeroed out; they are only loaded from disk so that they will be written back after finalization completes.
public
prepareToFinalize() : mixed
Return values
mixed —prepareToLabel()
Prepare this classifier instance for labeling. This operation requires all of the heavyweight member data save the final features and algorithm. Note that these properties are set to references to the Naive Bayes features and algorithm, so that Naive Bayes will be used to tentatively classify documents during labeling (purely to give the user some feedback on how the training set is performing).
public
prepareToLabel() : mixed
Return values
mixed —refreshBuffer()
Adds as many new documents to the candidate buffer as necessary to reach the specified buffer size, which defaults to the runtime parameter.
public
refreshBuffer(object $mix_iterator[, int $buffer_size = null ]) : int
Returns the final buffer size, which may be less than that requested if the iterator doesn't return enough documents.
Parameters
- $mix_iterator : object
-
crawl mix iterator to draw documents from
- $buffer_size : int = null
-
optional buffer size to use; defaults to the runtime parameter
Return values
int —final buffer size
setClassifier()
Stores a classifier instance to disk, first separating it out into individual files containing serialized and compressed property data. The basic classifier information, such as class label and summary statistics, is stored uncompressed in a file called `classifier.txt'.
public
static setClassifier(object $classifier) : mixed
The classifier directory and all of its contents are made world-writable so that they can be manipulated without hassle from the command line.
Parameters
- $classifier : object
-
Classifier instance to store to disk
Return values
mixed —storeLoadedProperties()
Stores the data associated with each property name listed in the loaded_properties instance attribute back to disk. The data for each property is stored in its own serialized and compressed file, and made world-writable.
public
storeLoadedProperties() : mixed
Return values
mixed —tokenizeDescription()
Tokenizes a string into a map from terms to within-string frequencies.
public
tokenizeDescription(string $description) : array<string|int, mixed>
Parameters
- $description : string
-
string to tokenize
Return values
array<string|int, mixed> —associative array mapping terms to their within-string frequencies
train()
Trains the Naive Bayes classification algorithm used during labeling on the current training set, and optionally updates the estimated accuracy.
public
train([bool $update_accuracy = false ]) : mixed
Parameters
- $update_accuracy : bool = false
-
optional parameter specifying whether or not to update the accuracy estimate after training completes; defaults to false
Return values
mixed —updateAccuracy()
Estimates current classification accuracy using a Naive Bayes classification algorithm. Accuracy is estimated by splitting the current training set into fifths, reserving four fifths for training, and the remaining fifth for testing. A fresh classifier is trained and tested on these splits, and the total accuracy recorded. Then the splits are rotated so that the previous testing fifth becomes part of the training set, and one of the blocks from the previous training set becomes the testing set. A new classifier is trained and tested on the new splits, and, again, the accuracy recorded. This process is repeated until all blocks have been used for testing, and the average accuracy recorded.
public
updateAccuracy([object $X = null ][, array<string|int, mixed> $y = null ]) : mixed
Parameters
- $X : object = null
-
optional sparse matrix representing the already-mapped training set to use; if not provided, the current training set is mapped using the label_features property
- $y : array<string|int, mixed> = null
-
optional array of document labels corresponding to the training set; if not provided the current training set labels are used