Yioop_V9.5_Source_Code_Documentation

Classifier
in package
implements CrawlConstants

The primary interface for building and using classifiers. An instance of this class represents a single classifier in memory, but the class also provides static methods to manage classifiers on disk.

A single classifier is a tool for determining the likelihood that a document is a positive instance of a particular class. In order to do this, a classifier goes through a training phase on a labeled training set where it learns weights for document features (terms, for our purposes). To classify a new document, the learned weights for all terms in the document are combined in order to yield a pdeudo-probability that the document belongs to the class.

A classifier is composed of a candidate buffer, a training set, a set of features, and a classification algorithm. In addition to the set of all features, there is a restricted set of features used for training and classification. There are also two classification algorithms: a Naive Bayes algorithm used during labeling, and a logistic regression algorithm used to train the final classifier. In general, a fresh classifier will first go through a labeling phase where a collection of labeled training documents is built up out of existing crawl indexes, and then a finalization phase where the logistic regression algorithm will be trained on the training set established in the first phase. After finalization, the classifier may be used to classify new web pages during a crawl.

During the labeling phase, the classifier fills a buffer of candidate pages from the user-selected index (optionally restricted by a query), and tries to pick the best one to present to the user to be labeled (here `best' means the one that, once labeled, is most likely to improve classification accuracy). Each labeled document is removed from the buffer, converted to a feature vector (described next), and added to the training set. The expanded training set is then used to train an intermediate Naive Bayes classification algorithm that is in turn used to more accurately identify good candidates for the next round of labeling. This phase continues until the user gets tired of labeling documents, or is happy with the estimated classification accuracy.

Instead of passing around terms everywhere, each document that goes into the training set is first mapped through a Features instance that maps terms to feature indices (e.g. "Pythagorean" => 1, "theorem" => 2, etc.). These feature indices are used internally by the classification algorithms, and by the algorithms that try to pick out the most informative features. In addition to keeping track of the mapping between terms and feature indices, a Features instance keeps term and label statistics (such as how often a term occurs in documents with a particular label) used to weight features within a document and to select informative features. Finally, subclasses of the Features class weight features in different ways, presenting more or less of everything that's known about the frequency or informativeness of a feature to classification algorithms.

Once a sufficiently-useful training set has been built, a FeatureSelection instance is used to choose the most informative features, and copy these into a reduced Features instance that has a much smaller vocabulary, and thus a much smaller memory footprint. For efficiency, this is the Features instance used to train classification algorithms, and to classify web pages. Finalization is just the process of training a logistic regression classification algorithm on the full training set. This results in a set of feature weights that can be used to efficiently assign a psuedo-probability to the proposition that a new web page is a positive instance of the class that the classifier has been trained to recognize. Training logistic regression on a large training set can take a long time, so this phase is carried out asynchronously, by a daemon launched in response to the finalization request.

Because the full Features instance, buffer, and training set are only needed during the labeling and finalization phases, and because they can get very large and take up a lot of space in memory, this class separates its large instance members into separate files when serializing to disk. When a classifier is first loaded into memory from disk it brings along only its summary statistics, since these are all that are needed to, for example, display a list of classifiers. In order to actually add new documents to the training set, finalize, or classify, the classifier must first be explicitly told to load the relevant data structures from disk; this is accomplished by methods like prepareToLabel and prepareToClassify. These methods load in the relevant serialized structures, and mark the associated data members for storage back to disk when (or if) the classifier is serialized again.

Tags
author

Shawn Tice

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

BUFFER_SIZE  = 51
The maximum number of candidate documents to consider at once in order to find the best candidate.
COMMITTEE_SIZE  = 3
The number of Naive Bayes instances to use to calculate disagreement during candidate selection.
DENSITY_BETA  = 3.0
Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).
DENSITY_LAMBDA  = 0.5
Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).
FINALIZED  = 2
Indicates that a classifier has been finalized, and is ready to be used for classification.
FINALIZING  = 1
Indicates that a classifier is currently being finalized (this may take a while).
MAX_DISAGREEMENT  = 1.63652
The maximum disagreement score between candidates. This number depends on committee size, and is used to provide a slightly more user-friendly estimate of how much disagreement a document causes (between 0 and 1).
THRESHOLD  = 0.5
Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= THRESHOLD are classified as positive instances.
UNFINALIZED  = 0
Indicates that a classifier needs to be finalized before it can be used.
$accuracy  : float
The estimated classification accuracy. This member may be null if the accuracy has not yet been estimated, or out of date if examples have been added to the training set since the last accuracy update, but no new estimate has been computed.
$buffer  : array<string|int, mixed>
The current pool of candidates for labeling. The first element in the buffer is always the active document, and as active documents are labeled and removed, the pool is refreshed with new candidates (if there are more pages to be drawn from the active index). The buffer is represented as an associative array with three fields: 'docs', the candidate page summaries; 'densities', an array of densities computed for the documents in the candidate pool; and 'stats', statistics about the terms and documents in the current pool.
$class_label  : string
The label applied to positive instances of the class learned by this classifier (e.g., `spam').
$docs  : array<string|int, mixed>
The training set, broken up into two fields of an associative array: 'features', an array of document feature vectors; and 'labels', the labels assigned to each document.
$final_algorithm  : object
The finalized classification algorithm that will be used to classify new web pages. Will usually be logistic regression, but may be Naive Bayes, if set by the options. During labeling, this field is a reference to the Naive Bayes classification algorithm (so that that algorithm will be used by the `classify' method), but it won't be saved to disk as such.
$final_features  : object
The Features subclass instance used to map documents at classification time to the feature vectors expected by classification algorithms. This will generally be a reduced feature set, just like that used during labeling, but potentially larger than the set used by Naive Bayes.
$finalized  : int
Finalization status, as determined by one of the three finalization constants.
$fresh  : bool
Whether or not this classifier has had any training examples added to it, and consequently whether or not its Naive Bayes classification algorithm has every been trained.
$full_features  : object
The Features subclass instance used to manage the full set of features seen across all documents in the training set.
$label_algorithm  : object
The NaiveBayes classification algorithm used during training to tentatively classify documents presented to the user for labeling.
$label_features  : object
The Features subclass instance used to manage the reduced set of features used only by Naive Bayes classification algorithms during the labeling phase.
$lang  : string
Language of documents in the training set (also how new documents will be treated).
$loaded_properties  : array<string|int, mixed>
The names of properties set by one of the prepareTo* methods; these properties will be saved back to disk during serialization, while all other properties not listed by the __sleep method will be discarded.
$negative  : int
The number of negative examples in the training set.
$options  : array<string|int, mixed>
Default per-classifier options, which may be overridden when constructing a new classifier. The supported options are:
$positive  : int
The number of positive examples in the training set.
$timestamp  : int
Creation time as a UNIX timestamp.
$total  : int
The total number of examples in the training set (sum of positive and negative).
__construct()  : mixed
Initializes a new classifier with a class label, and options to override the defaults. The timestamp associated with the classifier is taken from the time of construction.
__sleep()  : array<string|int, mixed>
Magic method that determines which member data will be stored when serializing this class. Only lightweight summary data are stored with the serialized version of this class. The heavier-weight properties are stored in individual, compressed files.
addAllDocuments()  : int
Iterates entirely through a crawl mix iterator, adding each document (that hasn't already been labeled) to the training set with a single label. This function works by running through the iterator, filling up the candidate buffer with all unlabeled documents, then repeatedly dropping the first buffer document and adding it to the training set.
addBufferDoc()  : mixed
Adds a page to the end of the candidate buffer, keeping the associated statistics up to date. During active training, each document in the buffer is tokenized, and the terms weighted by frequency; the term frequencies across documents in the buffer are tracked as well. With no active training, the buffer is simply an array of page summaries.
classify()  : float
Classifies a page summary using the current final classification algorithm and features, and returns the classification score. This method is also used during the labeling phase to provide a tentative label for candidates, and in this case the final algorithm is actually a reference to a Naive Bayes instance and final_features is a reference to label_features; neither of these gets saved to disk, however.
cleanLabel()  : mixed
Removes all but alphanumeric characters and underscores from a label, so that it may be easily saved to disk and used in queries as a meta word.
computeBufferDensities()  : mixed
Computes from scratch the buffer densities of the documents in the current candidate pool. This is an expensive operation that requires the computation of the KL-divergence between each ordered pair of documents in the pool, approximately O(N^2) computations, total (where N is the number of documents in the pool). The densities are saved in the buffer data structure.
deleteClassifier()  : mixed
Deletes the directory corresponding to a class label, and all of its contents. In the case that there is no classifier with the passed in label, does nothing.
dropBufferDoc()  : mixed
Removes the document at the front of the candidate buffer. During active training the cross-document statistics for terms occurring in the document being removed are maintained.
finalize()  : mixed
Trains the final classification algorithm on the full training set, using a subset of the full feature set. The final algorithm will usually be logistic regression, but can be set to Naive Bayes with the appropriate runtime option. Once finalization completes, updates the `finalized' attribute.
findNextDocumentToLabel()  : array<string|int, mixed>
Finds the next best document for labeling amongst the documents in the candidate buffer, moves that candidate to the front of the buffer, and returns it. The best candidate is the one with the maximum product of disagreement and density, where the density has already been calculated for each document in the current pool, and the disagreement is the KL-divergence between the classification scores obtained from a committee of Naive Bayes classifiers, each sampled from the current set of features.
getClassifier()  : object
Returns the minimal classifier instance corresponding to a class label, or null if no such classifier exists on disk.
getClassifierList()  : array<string|int, mixed>
Returns an array of classifier instances currently stored in the classifiers directory. The array maps class labels to their corresponding classifiers, and each classifier is a minimal instance, containing only summary statistics.
getCrawlMixName()  : string
Returns a name for the crawl mix associated with a class label.
initBuffer()  : int
Drops any existing candidate buffer, re-initializes the buffer structure, then calls refreshBuffer to fill it. Takes an optional buffer size, which can be used to limit the buffer to something other than the number imposed by the runtime parameter. Returns the final buffer size.
klDivergenceToMean()  : float
Calculates the KL-divergence to the mean for a collection of discrete two-element probability distributions. Each distribution is specified by a single probability, p, since the second probability is just 1 - p. The KL-divergence to the mean is used as a measure of disagreement between members of a committee of classifiers, where each member assigns a classification score to the same document.
labelDocument()  : bool
Updates the buffer and training set to reflect the label given to a new document. The label may be -1, 1, or 0, where the first two correspond to a negative or positive example, and the last to a skip. The handling for a skip is necessarily different from that for a positive or negative label, and matters are further complicated by the possibility that we may be changing a label for a document that's already in the training set, rather than adding a new document. This function returns true if the new label resulted in a change to the training set, and false otherwise (i.e., if the user simply skipped labeling the candidate document).
labelPage()  : mixed
Given a page summary (passed by reference) and a list of classifiers, augments the summary meta words with the class label of each classifier that scores the summary above a threshold. This static method is used by fetchers to classify downloaded pages. In addition to the class label, the pseudo-probability that the document belongs to the class is recorded as well. This is recorded both as the score rounded down to the nearest multiple of ten, and as "<n>plus" for each multiple of ten, n, less than the score and greater than or equal to the threshold.
loadClassifiersData()  : array<string|int, mixed>
Given a list of class labels, returns an array mapping each class label to an array of data necessary for initializing a classifier for that label. This static method is used to prepare a collection of classifiers for distribution to fetchers, so that each fetcher can classify pages as it downloads them. The only extra properties passed along in addition to the base classification data are the final features and final algorithm, both necessary for classifying new documents.
loadProperties()  : mixed
Loads class attributes from compressed, serialized files on disk, and stores their names so that they will be saved back to disk later. Each property (if it has been previously set) is stored in its own file under the classifier's data directory, named after the property. The file is compressed using gzip, but without gzip headers, so it can't actually be decompressed by the standard gzip utility. If a file doesn't exist, then the instance property is left untouched. The property names are passed as a variable number of arguments.
makeKey()  : string
Returns a key that can be used internally to refer internally to a particular page summary.
moveBufferDocToFront()  : mixed
Moves a document in the candidate buffer up to the front, in preparation for a label request. The document is specified by its index in the buffer.
newClassifierFromData()  : object
The dual of loadClassifiersData, this static method reconstitutes a Classifier instance from an array containing the necessary data. This gets called by each fetcher, using the data that it receives from the name server when establishing a new crawl.
prepareToClassify()  : mixed
Prepare to classify new web pages. This operation requires only the final features and classification algorithm, which are expected to be defined after the finalization phase.
prepareToFinalize()  : mixed
Prepare to train a final classification algorithm on the full training set. This operation requires the full training set and features, but not the candidate buffer used during labeling. Note that any existing final features and classification algorithm are simply zeroed out; they are only loaded from disk so that they will be written back after finalization completes.
prepareToLabel()  : mixed
Prepare this classifier instance for labeling. This operation requires all of the heavyweight member data save the final features and algorithm. Note that these properties are set to references to the Naive Bayes features and algorithm, so that Naive Bayes will be used to tentatively classify documents during labeling (purely to give the user some feedback on how the training set is performing).
refreshBuffer()  : int
Adds as many new documents to the candidate buffer as necessary to reach the specified buffer size, which defaults to the runtime parameter.
setClassifier()  : mixed
Stores a classifier instance to disk, first separating it out into individual files containing serialized and compressed property data. The basic classifier information, such as class label and summary statistics, is stored uncompressed in a file called `classifier.txt'.
storeLoadedProperties()  : mixed
Stores the data associated with each property name listed in the loaded_properties instance attribute back to disk. The data for each property is stored in its own serialized and compressed file, and made world-writable.
tokenizeDescription()  : array<string|int, mixed>
Tokenizes a string into a map from terms to within-string frequencies.
train()  : mixed
Trains the Naive Bayes classification algorithm used during labeling on the current training set, and optionally updates the estimated accuracy.
updateAccuracy()  : mixed
Estimates current classification accuracy using a Naive Bayes classification algorithm. Accuracy is estimated by splitting the current training set into fifths, reserving four fifths for training, and the remaining fifth for testing. A fresh classifier is trained and tested on these splits, and the total accuracy recorded. Then the splits are rotated so that the previous testing fifth becomes part of the training set, and one of the blocks from the previous training set becomes the testing set. A new classifier is trained and tested on the new splits, and, again, the accuracy recorded. This process is repeated until all blocks have been used for testing, and the average accuracy recorded.

Constants

BUFFER_SIZE

The maximum number of candidate documents to consider at once in order to find the best candidate.

public mixed BUFFER_SIZE = 51

COMMITTEE_SIZE

The number of Naive Bayes instances to use to calculate disagreement during candidate selection.

public mixed COMMITTEE_SIZE = 3

DENSITY_BETA

Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).

public mixed DENSITY_BETA = 3.0

DENSITY_LAMBDA

Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).

public mixed DENSITY_LAMBDA = 0.5

FINALIZED

Indicates that a classifier has been finalized, and is ready to be used for classification.

public mixed FINALIZED = 2

FINALIZING

Indicates that a classifier is currently being finalized (this may take a while).

public mixed FINALIZING = 1

MAX_DISAGREEMENT

The maximum disagreement score between candidates. This number depends on committee size, and is used to provide a slightly more user-friendly estimate of how much disagreement a document causes (between 0 and 1).

public mixed MAX_DISAGREEMENT = 1.63652

THRESHOLD

Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= THRESHOLD are classified as positive instances.

public mixed THRESHOLD = 0.5

UNFINALIZED

Indicates that a classifier needs to be finalized before it can be used.

public mixed UNFINALIZED = 0

Properties

$accuracy

The estimated classification accuracy. This member may be null if the accuracy has not yet been estimated, or out of date if examples have been added to the training set since the last accuracy update, but no new estimate has been computed.

public float $accuracy

$buffer

The current pool of candidates for labeling. The first element in the buffer is always the active document, and as active documents are labeled and removed, the pool is refreshed with new candidates (if there are more pages to be drawn from the active index). The buffer is represented as an associative array with three fields: 'docs', the candidate page summaries; 'densities', an array of densities computed for the documents in the candidate pool; and 'stats', statistics about the terms and documents in the current pool.

public array<string|int, mixed> $buffer

$class_label

The label applied to positive instances of the class learned by this classifier (e.g., `spam').

public string $class_label

$docs

The training set, broken up into two fields of an associative array: 'features', an array of document feature vectors; and 'labels', the labels assigned to each document.

public array<string|int, mixed> $docs

$final_algorithm

The finalized classification algorithm that will be used to classify new web pages. Will usually be logistic regression, but may be Naive Bayes, if set by the options. During labeling, this field is a reference to the Naive Bayes classification algorithm (so that that algorithm will be used by the `classify' method), but it won't be saved to disk as such.

public object $final_algorithm

$final_features

The Features subclass instance used to map documents at classification time to the feature vectors expected by classification algorithms. This will generally be a reduced feature set, just like that used during labeling, but potentially larger than the set used by Naive Bayes.

public object $final_features

$finalized

Finalization status, as determined by one of the three finalization constants.

public int $finalized = 0

$fresh

Whether or not this classifier has had any training examples added to it, and consequently whether or not its Naive Bayes classification algorithm has every been trained.

public bool $fresh = true

$full_features

The Features subclass instance used to manage the full set of features seen across all documents in the training set.

public object $full_features

$label_algorithm

The NaiveBayes classification algorithm used during training to tentatively classify documents presented to the user for labeling.

public object $label_algorithm

$label_features

The Features subclass instance used to manage the reduced set of features used only by Naive Bayes classification algorithms during the labeling phase.

public object $label_features

$lang

Language of documents in the training set (also how new documents will be treated).

public string $lang

$loaded_properties

The names of properties set by one of the prepareTo* methods; these properties will be saved back to disk during serialization, while all other properties not listed by the __sleep method will be discarded.

public array<string|int, mixed> $loaded_properties = []

$negative

The number of negative examples in the training set.

public int $negative = 0

$options

Default per-classifier options, which may be overridden when constructing a new classifier. The supported options are:

public array<string|int, mixed> $options = ['density' => ['lambda' => 0.5, 'beta' => 3.0], 'threshold' => 0.5, 'label_fs' => ['max' => 30], 'final_fs' => ['max' => 200], 'final_algo' => 'lr']

float density.lambda: Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).

float density.beta: Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).

int label_fs.max: Use the `label_fs' most informative features to train the Naive Bayes classifiers used during labeling to compute disagreement for a document.

float threshold: Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= `threshold' are classified as positive instances.

string final_algo: Algorithm to use for finalization; 'lr' for logistic regression, or 'nb' for Naive Bayes; default 'lr'.

int final_fs.max: Use the `final_fs' most informative features to train the final classifier.

$positive

The number of positive examples in the training set.

public int $positive = 0

$timestamp

Creation time as a UNIX timestamp.

public int $timestamp

$total

The total number of examples in the training set (sum of positive and negative).

public int $total = 0

Methods

__construct()

Initializes a new classifier with a class label, and options to override the defaults. The timestamp associated with the classifier is taken from the time of construction.

public __construct(string $label[, array<string|int, mixed> $options = [] ]) : mixed
Parameters
$label : string

class label applied to positive instances of the class this classifier is trained to recognize

$options : array<string|int, mixed> = []

optional associative array of options that will override the default options

Return values
mixed

__sleep()

Magic method that determines which member data will be stored when serializing this class. Only lightweight summary data are stored with the serialized version of this class. The heavier-weight properties are stored in individual, compressed files.

public __sleep() : array<string|int, mixed>
Return values
array<string|int, mixed>

names of properties to store when serializing this instance

addAllDocuments()

Iterates entirely through a crawl mix iterator, adding each document (that hasn't already been labeled) to the training set with a single label. This function works by running through the iterator, filling up the candidate buffer with all unlabeled documents, then repeatedly dropping the first buffer document and adding it to the training set.

public addAllDocuments(object $mix_iterator, int $label[, int $limit = INF ]) : int

Returns the total number of newly-labeled documents.

Parameters
$mix_iterator : object

crawl mix iterator to draw documents from

$label : int

label to apply to every document; -1 or 1, but NOT 0

$limit : int = INF

optional upper bound on the number of documents to add; defaults to no limit

Return values
int

total number of newly-labeled documents

addBufferDoc()

Adds a page to the end of the candidate buffer, keeping the associated statistics up to date. During active training, each document in the buffer is tokenized, and the terms weighted by frequency; the term frequencies across documents in the buffer are tracked as well. With no active training, the buffer is simply an array of page summaries.

public addBufferDoc(array<string|int, mixed> $page[, bool $is_active = true ]) : mixed
Parameters
$page : array<string|int, mixed>

page summary for the document to add to the buffer

$is_active : bool = true

whether this operation is part of active training, in which case some extra statistics must be maintained

Return values
mixed

classify()

Classifies a page summary using the current final classification algorithm and features, and returns the classification score. This method is also used during the labeling phase to provide a tentative label for candidates, and in this case the final algorithm is actually a reference to a Naive Bayes instance and final_features is a reference to label_features; neither of these gets saved to disk, however.

public classify(array<string|int, mixed> $page) : float
Parameters
$page : array<string|int, mixed>

page summary array for the page to be classified

Return values
float

pseudo-probability that the page is a positive instance of the target class

cleanLabel()

Removes all but alphanumeric characters and underscores from a label, so that it may be easily saved to disk and used in queries as a meta word.

public static cleanLabel(string $label) : mixed
Parameters
$label : string

class label to clean

Return values
mixed

computeBufferDensities()

Computes from scratch the buffer densities of the documents in the current candidate pool. This is an expensive operation that requires the computation of the KL-divergence between each ordered pair of documents in the pool, approximately O(N^2) computations, total (where N is the number of documents in the pool). The densities are saved in the buffer data structure.

public computeBufferDensities() : mixed

The density of a document is approximated by its average overlap with every other document in the candidate buffer, where the overlap between two documents is itself approximated using the exponential, negative KL-divergence between them. The KL-divergence is smoothed to deal with features (terms) that occur in one distribution (document) but not the other, and then multiplied by a negative constant and exponentiated in order to convert it to a kind of linear overlap score.

Return values
mixed

deleteClassifier()

Deletes the directory corresponding to a class label, and all of its contents. In the case that there is no classifier with the passed in label, does nothing.

public static deleteClassifier(string $label) : mixed
Parameters
$label : string

class label of the classifier to be deleted

Return values
mixed

dropBufferDoc()

Removes the document at the front of the candidate buffer. During active training the cross-document statistics for terms occurring in the document being removed are maintained.

public dropBufferDoc([bool $is_active = true ]) : mixed
Parameters
$is_active : bool = true

whether this operation is part of active training, in which case some extra statistics must be maintained

Return values
mixed

finalize()

Trains the final classification algorithm on the full training set, using a subset of the full feature set. The final algorithm will usually be logistic regression, but can be set to Naive Bayes with the appropriate runtime option. Once finalization completes, updates the `finalized' attribute.

public finalize() : mixed
Return values
mixed

findNextDocumentToLabel()

Finds the next best document for labeling amongst the documents in the candidate buffer, moves that candidate to the front of the buffer, and returns it. The best candidate is the one with the maximum product of disagreement and density, where the density has already been calculated for each document in the current pool, and the disagreement is the KL-divergence between the classification scores obtained from a committee of Naive Bayes classifiers, each sampled from the current set of features.

public findNextDocumentToLabel() : array<string|int, mixed>
Return values
array<string|int, mixed>

two-element array containing first the best candidate, and second the disagreement score, obtained by dividing the disagreement for the document by the maximum disagreement possible for the committee size

getClassifier()

Returns the minimal classifier instance corresponding to a class label, or null if no such classifier exists on disk.

public static getClassifier(string $label) : object
Parameters
$label : string

classifier's class label

Return values
object

classifier instance with the relevant class label, or null if no such classifier exists on disk

getClassifierList()

Returns an array of classifier instances currently stored in the classifiers directory. The array maps class labels to their corresponding classifiers, and each classifier is a minimal instance, containing only summary statistics.

public static getClassifierList() : array<string|int, mixed>
Return values
array<string|int, mixed>

associative array of class labels mapped to their corresponding classifier instances

getCrawlMixName()

Returns a name for the crawl mix associated with a class label.

public static getCrawlMixName(string $label) : string
Parameters
$label : string

class label associated with the crawl mix

Return values
string

name that can be used for the crawl mix associated with $label

initBuffer()

Drops any existing candidate buffer, re-initializes the buffer structure, then calls refreshBuffer to fill it. Takes an optional buffer size, which can be used to limit the buffer to something other than the number imposed by the runtime parameter. Returns the final buffer size.

public initBuffer(object $mix_iterator[, int $buffer_size = null ]) : int
Parameters
$mix_iterator : object

crawl mix iterator to draw documents from

$buffer_size : int = null

optional buffer size to use; defaults to the runtime parameter

Return values
int

final buffer size

klDivergenceToMean()

Calculates the KL-divergence to the mean for a collection of discrete two-element probability distributions. Each distribution is specified by a single probability, p, since the second probability is just 1 - p. The KL-divergence to the mean is used as a measure of disagreement between members of a committee of classifiers, where each member assigns a classification score to the same document.

public static klDivergenceToMean(array<string|int, mixed> $ps) : float
Parameters
$ps : array<string|int, mixed>

probabilities describing several discrete two-element probability distributions

Return values
float

KL-divergence to the mean for the collection of distributions

labelDocument()

Updates the buffer and training set to reflect the label given to a new document. The label may be -1, 1, or 0, where the first two correspond to a negative or positive example, and the last to a skip. The handling for a skip is necessarily different from that for a positive or negative label, and matters are further complicated by the possibility that we may be changing a label for a document that's already in the training set, rather than adding a new document. This function returns true if the new label resulted in a change to the training set, and false otherwise (i.e., if the user simply skipped labeling the candidate document).

public labelDocument(string $key, int $label[, bool $is_active = true ]) : bool

When updating an existing document, we will either need to swap the label in the training set and update the statistics stored by the Features instance (since now the features are associated with a different label), or drop the document from the training set and (again) update the statistics stored by the Features instance. In either case the negative and positive counts must be updated as well.

When working with a new document, we need to remove it from the candidate buffer, and if the label is non-zero then we also need to add the document to the training set. That involves tokenizing the document, passing the tokens through the full_features instance, and storing the resulting feature vector, plus the new label in the docs attribute. The positive and negative counts must be updated as well.

Finally, if this operation is occurring active labeling (when the user is providing labels one at a time), that information needs to be passed along to dropBufferDoc, which can avoid doing some work in the non-active case.

Parameters
$key : string

key used to select the document from the docs array

$label : int

new label (-1, 1, or 0)

$is_active : bool = true

whether this operation is being carried out during active labeling

Return values
bool

true if the training set was modified, and false otherwise

labelPage()

Given a page summary (passed by reference) and a list of classifiers, augments the summary meta words with the class label of each classifier that scores the summary above a threshold. This static method is used by fetchers to classify downloaded pages. In addition to the class label, the pseudo-probability that the document belongs to the class is recorded as well. This is recorded both as the score rounded down to the nearest multiple of ten, and as "<n>plus" for each multiple of ten, n, less than the score and greater than or equal to the threshold.

public static labelPage(array<string|int, mixed> &$summary, array<string|int, mixed> $classifiers, array<string|int, mixed> &$active_classifiers, array<string|int, mixed> &$active_rankers) : mixed

As an example, suppose that a classifier with class label `label' has determined that a document is a positive example with pseudo-probability 0.87 and threshold 0.5. The following meta words are added to the summary: class:label, class:label:80, class:label:80plus, class:label:70plus, class:label:60plus, and class:label:50plus.

Parameters
$summary : array<string|int, mixed>

page summary to classify, passed by reference

$classifiers : array<string|int, mixed>

list of Classifier instances, each prepared for classifying (via the prepareToClassify method)

$active_classifiers : array<string|int, mixed>
$active_rankers : array<string|int, mixed>
Return values
mixed

loadClassifiersData()

Given a list of class labels, returns an array mapping each class label to an array of data necessary for initializing a classifier for that label. This static method is used to prepare a collection of classifiers for distribution to fetchers, so that each fetcher can classify pages as it downloads them. The only extra properties passed along in addition to the base classification data are the final features and final algorithm, both necessary for classifying new documents.

public static loadClassifiersData(array<string|int, mixed> $labels) : array<string|int, mixed>
Parameters
$labels : array<string|int, mixed>

flat array of class labels for which to load data

Return values
array<string|int, mixed>

associative array mapping class labels to arrays of data necessary for initializing the associated classifier

loadProperties()

Loads class attributes from compressed, serialized files on disk, and stores their names so that they will be saved back to disk later. Each property (if it has been previously set) is stored in its own file under the classifier's data directory, named after the property. The file is compressed using gzip, but without gzip headers, so it can't actually be decompressed by the standard gzip utility. If a file doesn't exist, then the instance property is left untouched. The property names are passed as a variable number of arguments.

public loadProperties() : mixed
Return values
mixed

makeKey()

Returns a key that can be used internally to refer internally to a particular page summary.

public static makeKey(array<string|int, mixed> $page) : string
Parameters
$page : array<string|int, mixed>

page summary to return a key for

Return values
string

key that uniquely identifies the page summary

moveBufferDocToFront()

Moves a document in the candidate buffer up to the front, in preparation for a label request. The document is specified by its index in the buffer.

public moveBufferDocToFront(int $i) : mixed
Parameters
$i : int

document index within the candidate buffer

Return values
mixed

newClassifierFromData()

The dual of loadClassifiersData, this static method reconstitutes a Classifier instance from an array containing the necessary data. This gets called by each fetcher, using the data that it receives from the name server when establishing a new crawl.

public static newClassifierFromData(array<string|int, mixed> $data) : object
Parameters
$data : array<string|int, mixed>

associative array mapping property names to their serialized and compressed data

Return values
object

Classifier instance built from the passed-in data

prepareToClassify()

Prepare to classify new web pages. This operation requires only the final features and classification algorithm, which are expected to be defined after the finalization phase.

public prepareToClassify() : mixed
Return values
mixed

prepareToFinalize()

Prepare to train a final classification algorithm on the full training set. This operation requires the full training set and features, but not the candidate buffer used during labeling. Note that any existing final features and classification algorithm are simply zeroed out; they are only loaded from disk so that they will be written back after finalization completes.

public prepareToFinalize() : mixed
Return values
mixed

prepareToLabel()

Prepare this classifier instance for labeling. This operation requires all of the heavyweight member data save the final features and algorithm. Note that these properties are set to references to the Naive Bayes features and algorithm, so that Naive Bayes will be used to tentatively classify documents during labeling (purely to give the user some feedback on how the training set is performing).

public prepareToLabel() : mixed
Return values
mixed

refreshBuffer()

Adds as many new documents to the candidate buffer as necessary to reach the specified buffer size, which defaults to the runtime parameter.

public refreshBuffer(object $mix_iterator[, int $buffer_size = null ]) : int

Returns the final buffer size, which may be less than that requested if the iterator doesn't return enough documents.

Parameters
$mix_iterator : object

crawl mix iterator to draw documents from

$buffer_size : int = null

optional buffer size to use; defaults to the runtime parameter

Return values
int

final buffer size

setClassifier()

Stores a classifier instance to disk, first separating it out into individual files containing serialized and compressed property data. The basic classifier information, such as class label and summary statistics, is stored uncompressed in a file called `classifier.txt'.

public static setClassifier(object $classifier) : mixed

The classifier directory and all of its contents are made world-writable so that they can be manipulated without hassle from the command line.

Parameters
$classifier : object

Classifier instance to store to disk

Return values
mixed

storeLoadedProperties()

Stores the data associated with each property name listed in the loaded_properties instance attribute back to disk. The data for each property is stored in its own serialized and compressed file, and made world-writable.

public storeLoadedProperties() : mixed
Return values
mixed

tokenizeDescription()

Tokenizes a string into a map from terms to within-string frequencies.

public tokenizeDescription(string $description) : array<string|int, mixed>
Parameters
$description : string

string to tokenize

Return values
array<string|int, mixed>

associative array mapping terms to their within-string frequencies

train()

Trains the Naive Bayes classification algorithm used during labeling on the current training set, and optionally updates the estimated accuracy.

public train([bool $update_accuracy = false ]) : mixed
Parameters
$update_accuracy : bool = false

optional parameter specifying whether or not to update the accuracy estimate after training completes; defaults to false

Return values
mixed

updateAccuracy()

Estimates current classification accuracy using a Naive Bayes classification algorithm. Accuracy is estimated by splitting the current training set into fifths, reserving four fifths for training, and the remaining fifth for testing. A fresh classifier is trained and tested on these splits, and the total accuracy recorded. Then the splits are rotated so that the previous testing fifth becomes part of the training set, and one of the blocks from the previous training set becomes the testing set. A new classifier is trained and tested on the new splits, and, again, the accuracy recorded. This process is repeated until all blocks have been used for testing, and the average accuracy recorded.

public updateAccuracy([object $X = null ][, array<string|int, mixed> $y = null ]) : mixed
Parameters
$X : object = null

optional sparse matrix representing the already-mapped training set to use; if not provided, the current training set is mapped using the label_features property

$y : array<string|int, mixed> = null

optional array of document labels corresponding to the training set; if not provided the current training set labels are used

Return values
mixed

        

Search results