Yioop_V9.5_Source_Code

Classifier
in package

Application

implements CrawlConstants

The primary interface for building and using classifiers. An instance of this class represents a single classifier in memory, but the class also provides static methods to manage classifiers on disk.

A single classifier is a tool for determining the likelihood that a document is a positive instance of a particular class. In order to do this, a classifier goes through a training phase on a labeled training set where it learns weights for document features (terms, for our purposes). To classify a new document, the learned weights for all terms in the document are combined in order to yield a pdeudo-probability that the document belongs to the class.

A classifier is composed of a candidate buffer, a training set, a set of features, and a classification algorithm. In addition to the set of all features, there is a restricted set of features used for training and classification. There are also two classification algorithms: a Naive Bayes algorithm used during labeling, and a logistic regression algorithm used to train the final classifier. In general, a fresh classifier will first go through a labeling phase where a collection of labeled training documents is built up out of existing crawl indexes, and then a finalization phase where the logistic regression algorithm will be trained on the training set established in the first phase. After finalization, the classifier may be used to classify new web pages during a crawl.

During the labeling phase, the classifier fills a buffer of candidate pages from the user-selected index (optionally restricted by a query), and tries to pick the best one to present to the user to be labeled (here `best' means the one that, once labeled, is most likely to improve classification accuracy). Each labeled document is removed from the buffer, converted to a feature vector (described next), and added to the training set. The expanded training set is then used to train an intermediate Naive Bayes classification algorithm that is in turn used to more accurately identify good candidates for the next round of labeling. This phase continues until the user gets tired of labeling documents, or is happy with the estimated classification accuracy.

Instead of passing around terms everywhere, each document that goes into the training set is first mapped through a Features instance that maps terms to feature indices (e.g. "Pythagorean" => 1, "theorem" => 2, etc.). These feature indices are used internally by the classification algorithms, and by the algorithms that try to pick out the most informative features. In addition to keeping track of the mapping between terms and feature indices, a Features instance keeps term and label statistics (such as how often a term occurs in documents with a particular label) used to weight features within a document and to select informative features. Finally, subclasses of the Features class weight features in different ways, presenting more or less of everything that's known about the frequency or informativeness of a feature to classification algorithms.

Once a sufficiently-useful training set has been built, a FeatureSelection instance is used to choose the most informative features, and copy these into a reduced Features instance that has a much smaller vocabulary, and thus a much smaller memory footprint. For efficiency, this is the Features instance used to train classification algorithms, and to classify web pages. Finalization is just the process of training a logistic regression classification algorithm on the full training set. This results in a set of feature weights that can be used to efficiently assign a psuedo-probability to the proposition that a new web page is a positive instance of the class that the classifier has been trained to recognize. Training logistic regression on a large training set can take a long time, so this phase is carried out asynchronously, by a daemon launched in response to the finalization request.

Because the full Features instance, buffer, and training set are only needed during the labeling and finalization phases, and because they can get very large and take up a lot of space in memory, this class separates its large instance members into separate files when serializing to disk. When a classifier is first loaded into memory from disk it brings along only its summary statistics, since these are all that are needed to, for example, display a list of classifiers. In order to actually add new documents to the training set, finalize, or classify, the classifier must first be explicitly told to load the relevant data structures from disk; this is accomplished by methods like prepareToLabel and prepareToClassify. These methods load in the relevant serialized structures, and mark the associated data members for storage back to disk when (or if) the classifier is serialized again.

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

BUFFER_SIZE = 51: The maximum number of candidate documents to consider at once in order to find the best candidate.
COMMITTEE_SIZE = 3: The number of Naive Bayes instances to use to calculate disagreement during candidate selection.
DENSITY_BETA = 3.0: Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).
DENSITY_LAMBDA = 0.5: Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).
FINALIZED = 2: Indicates that a classifier has been finalized, and is ready to be used for classification.
FINALIZING = 1: Indicates that a classifier is currently being finalized (this may take a while).
MAX_DISAGREEMENT = 1.63652: The maximum disagreement score between candidates. This number depends on committee size, and is used to provide a slightly more user-friendly estimate of how much disagreement a document causes (between 0 and 1).
THRESHOLD = 0.5: Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= THRESHOLD are classified as positive instances.
UNFINALIZED = 0: Indicates that a classifier needs to be finalized before it can be used.
$accuracy : float: The estimated classification accuracy. This member may be null if the accuracy has not yet been estimated, or out of date if examples have been added to the training set since the last accuracy update, but no new estimate has been computed.
$buffer : array<string|int, mixed>: The current pool of candidates for labeling. The first element in the buffer is always the active document, and as active documents are labeled and removed, the pool is refreshed with new candidates (if there are more pages to be drawn from the active index). The buffer is represented as an associative array with three fields: 'docs', the candidate page summaries; 'densities', an array of densities computed for the documents in the candidate pool; and 'stats', statistics about the terms and documents in the current pool.
$class_label : string: The label applied to positive instances of the class learned by this classifier (e.g., `spam').
$docs : array<string|int, mixed>: The training set, broken up into two fields of an associative array: 'features', an array of document feature vectors; and 'labels', the labels assigned to each document.
$final_algorithm : object: The finalized classification algorithm that will be used to classify new web pages. Will usually be logistic regression, but may be Naive Bayes, if set by the options. During labeling, this field is a reference to the Naive Bayes classification algorithm (so that that algorithm will be used by the `classify' method), but it won't be saved to disk as such.
$final_features : object: The Features subclass instance used to map documents at classification time to the feature vectors expected by classification algorithms. This will generally be a reduced feature set, just like that used during labeling, but potentially larger than the set used by Naive Bayes.
$finalized : int: Finalization status, as determined by one of the three finalization constants.
$fresh : bool: Whether or not this classifier has had any training examples added to it, and consequently whether or not its Naive Bayes classification algorithm has every been trained.
$full_features : object: The Features subclass instance used to manage the full set of features seen across all documents in the training set.
$label_algorithm : object: The NaiveBayes classification algorithm used during training to tentatively classify documents presented to the user for labeling.
$label_features : object: The Features subclass instance used to manage the reduced set of features used only by Naive Bayes classification algorithms during the labeling phase.
$lang : string: Language of documents in the training set (also how new documents will be treated).
$loaded_properties : array<string|int, mixed>: The names of properties set by one of the prepareTo* methods; these properties will be saved back to disk during serialization, while all other properties not listed by the __sleep method will be discarded.
$negative : int: The number of negative examples in the training set.
$options : array<string|int, mixed>: Default per-classifier options, which may be overridden when constructing a new classifier. The supported options are:
$positive : int: The number of positive examples in the training set.
$timestamp : int: Creation time as a UNIX timestamp.
$total : int: The total number of examples in the training set (sum of positive and negative).
__construct() : mixed: Initializes a new classifier with a class label, and options to override the defaults. The timestamp associated with the classifier is taken from the time of construction.
__sleep() : array<string|int, mixed>: Magic method that determines which member data will be stored when serializing this class. Only lightweight summary data are stored with the serialized version of this class. The heavier-weight properties are stored in individual, compressed files.
addAllDocuments() : int: Iterates entirely through a crawl mix iterator, adding each document (that hasn't already been labeled) to the training set with a single label. This function works by running through the iterator, filling up the candidate buffer with all unlabeled documents, then repeatedly dropping the first buffer document and adding it to the training set.
addBufferDoc() : mixed: Adds a page to the end of the candidate buffer, keeping the associated statistics up to date. During active training, each document in the buffer is tokenized, and the terms weighted by frequency; the term frequencies across documents in the buffer are tracked as well. With no active training, the buffer is simply an array of page summaries.
classify() : float: Classifies a page summary using the current final classification algorithm and features, and returns the classification score. This method is also used during the labeling phase to provide a tentative label for candidates, and in this case the final algorithm is actually a reference to a Naive Bayes instance and final_features is a reference to label_features; neither of these gets saved to disk, however.
cleanLabel() : mixed: Removes all but alphanumeric characters and underscores from a label, so that it may be easily saved to disk and used in queries as a meta word.
computeBufferDensities() : mixed: Computes from scratch the buffer densities of the documents in the current candidate pool. This is an expensive operation that requires the computation of the KL-divergence between each ordered pair of documents in the pool, approximately O(N^2) computations, total (where N is the number of documents in the pool). The densities are saved in the buffer data structure.
deleteClassifier() : mixed: Deletes the directory corresponding to a class label, and all of its contents. In the case that there is no classifier with the passed in label, does nothing.
dropBufferDoc() : mixed: Removes the document at the front of the candidate buffer. During active training the cross-document statistics for terms occurring in the document being removed are maintained.
finalize() : mixed: Trains the final classification algorithm on the full training set, using a subset of the full feature set. The final algorithm will usually be logistic regression, but can be set to Naive Bayes with the appropriate runtime option. Once finalization completes, updates the `finalized' attribute.
findNextDocumentToLabel() : array<string|int, mixed>: Finds the next best document for labeling amongst the documents in the candidate buffer, moves that candidate to the front of the buffer, and returns it. The best candidate is the one with the maximum product of disagreement and density, where the density has already been calculated for each document in the current pool, and the disagreement is the KL-divergence between the classification scores obtained from a committee of Naive Bayes classifiers, each sampled from the current set of features.
getClassifier() : object: Returns the minimal classifier instance corresponding to a class label, or null if no such classifier exists on disk.
getClassifierList() : array<string|int, mixed>: Returns an array of classifier instances currently stored in the classifiers directory. The array maps class labels to their corresponding classifiers, and each classifier is a minimal instance, containing only summary statistics.
getCrawlMixName() : string: Returns a name for the crawl mix associated with a class label.
initBuffer() : int: Drops any existing candidate buffer, re-initializes the buffer structure, then calls refreshBuffer to fill it. Takes an optional buffer size, which can be used to limit the buffer to something other than the number imposed by the runtime parameter. Returns the final buffer size.
klDivergenceToMean() : float: Calculates the KL-divergence to the mean for a collection of discrete two-element probability distributions. Each distribution is specified by a single probability, p, since the second probability is just 1 - p. The KL-divergence to the mean is used as a measure of disagreement between members of a committee of classifiers, where each member assigns a classification score to the same document.
labelDocument() : bool: Updates the buffer and training set to reflect the label given to a new document. The label may be -1, 1, or 0, where the first two correspond to a negative or positive example, and the last to a skip. The handling for a skip is necessarily different from that for a positive or negative label, and matters are further complicated by the possibility that we may be changing a label for a document that's already in the training set, rather than adding a new document. This function returns true if the new label resulted in a change to the training set, and false otherwise (i.e., if the user simply skipped labeling the candidate document).
labelPage() : mixed: Given a page summary (passed by reference) and a list of classifiers, augments the summary meta words with the class label of each classifier that scores the summary above a threshold. This static method is used by fetchers to classify downloaded pages. In addition to the class label, the pseudo-probability that the document belongs to the class is recorded as well. This is recorded both as the score rounded down to the nearest multiple of ten, and as "<n>plus" for each multiple of ten, n, less than the score and greater than or equal to the threshold.
loadClassifiersData() : array<string|int, mixed>: Given a list of class labels, returns an array mapping each class label to an array of data necessary for initializing a classifier for that label. This static method is used to prepare a collection of classifiers for distribution to fetchers, so that each fetcher can classify pages as it downloads them. The only extra properties passed along in addition to the base classification data are the final features and final algorithm, both necessary for classifying new documents.
loadProperties() : mixed: Loads class attributes from compressed, serialized files on disk, and stores their names so that they will be saved back to disk later. Each property (if it has been previously set) is stored in its own file under the classifier's data directory, named after the property. The file is compressed using gzip, but without gzip headers, so it can't actually be decompressed by the standard gzip utility. If a file doesn't exist, then the instance property is left untouched. The property names are passed as a variable number of arguments.
makeKey() : string: Returns a key that can be used internally to refer internally to a particular page summary.
moveBufferDocToFront() : mixed: Moves a document in the candidate buffer up to the front, in preparation for a label request. The document is specified by its index in the buffer.
newClassifierFromData() : object: The dual of loadClassifiersData, this static method reconstitutes a Classifier instance from an array containing the necessary data. This gets called by each fetcher, using the data that it receives from the name server when establishing a new crawl.
prepareToClassify() : mixed: Prepare to classify new web pages. This operation requires only the final features and classification algorithm, which are expected to be defined after the finalization phase.
prepareToFinalize() : mixed: Prepare to train a final classification algorithm on the full training set. This operation requires the full training set and features, but not the candidate buffer used during labeling. Note that any existing final features and classification algorithm are simply zeroed out; they are only loaded from disk so that they will be written back after finalization completes.
prepareToLabel() : mixed: Prepare this classifier instance for labeling. This operation requires all of the heavyweight member data save the final features and algorithm. Note that these properties are set to references to the Naive Bayes features and algorithm, so that Naive Bayes will be used to tentatively classify documents during labeling (purely to give the user some feedback on how the training set is performing).
refreshBuffer() : int: Adds as many new documents to the candidate buffer as necessary to reach the specified buffer size, which defaults to the runtime parameter.
setClassifier() : mixed: Stores a classifier instance to disk, first separating it out into individual files containing serialized and compressed property data. The basic classifier information, such as class label and summary statistics, is stored uncompressed in a file called `classifier.txt'.
storeLoadedProperties() : mixed: Stores the data associated with each property name listed in the loaded_properties instance attribute back to disk. The data for each property is stored in its own serialized and compressed file, and made world-writable.
tokenizeDescription() : array<string|int, mixed>: Tokenizes a string into a map from terms to within-string frequencies.
train() : mixed: Trains the Naive Bayes classification algorithm used during labeling on the current training set, and optionally updates the estimated accuracy.
updateAccuracy() : mixed: Estimates current classification accuracy using a Naive Bayes classification algorithm. Accuracy is estimated by splitting the current training set into fifths, reserving four fifths for training, and the remaining fifth for testing. A fresh classifier is trained and tested on these splits, and the total accuracy recorded. Then the splits are rotated so that the previous testing fifth becomes part of the training set, and one of the blocks from the previous training set becomes the testing set. A new classifier is trained and tested on the new splits, and, again, the accuracy recorded. This process is repeated until all blocks have been used for testing, and the average accuracy recorded.

BUFFER_SIZE

The maximum number of candidate documents to consider at once in order to find the best candidate.


    public
        mixed
    BUFFER_SIZE
    = 51

COMMITTEE_SIZE

The number of Naive Bayes instances to use to calculate disagreement during candidate selection.


    public
        mixed
    COMMITTEE_SIZE
    = 3

DENSITY_BETA

Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).


    public
        mixed
    DENSITY_BETA
    = 3.0

DENSITY_LAMBDA

Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).


    public
        mixed
    DENSITY_LAMBDA
    = 0.5

FINALIZED

Indicates that a classifier has been finalized, and is ready to be used for classification.


    public
        mixed
    FINALIZED
    = 2

FINALIZING

Indicates that a classifier is currently being finalized (this may take a while).


    public
        mixed
    FINALIZING
    = 1

MAX_DISAGREEMENT

The maximum disagreement score between candidates. This number depends on committee size, and is used to provide a slightly more user-friendly estimate of how much disagreement a document causes (between 0 and 1).


    public
        mixed
    MAX_DISAGREEMENT
    = 1.63652

THRESHOLD

Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= THRESHOLD are classified as positive instances.


    public
        mixed
    THRESHOLD
    = 0.5

UNFINALIZED

Indicates that a classifier needs to be finalized before it can be used.


    public
        mixed
    UNFINALIZED
    = 0

$accuracy

The estimated classification accuracy. This member may be null if the accuracy has not yet been estimated, or out of date if examples have been added to the training set since the last accuracy update, but no new estimate has been computed.


    public
        float
    $accuracy

$buffer

The current pool of candidates for labeling. The first element in the buffer is always the active document, and as active documents are labeled and removed, the pool is refreshed with new candidates (if there are more pages to be drawn from the active index). The buffer is represented as an associative array with three fields: 'docs', the candidate page summaries; 'densities', an array of densities computed for the documents in the candidate pool; and 'stats', statistics about the terms and documents in the current pool.


    public
        array<string|int, mixed>
    $buffer

$class_label

The label applied to positive instances of the class learned by this classifier (e.g., `spam').


    public
        string
    $class_label

$docs

The training set, broken up into two fields of an associative array: 'features', an array of document feature vectors; and 'labels', the labels assigned to each document.


    public
        array<string|int, mixed>
    $docs

$final_algorithm

The finalized classification algorithm that will be used to classify new web pages. Will usually be logistic regression, but may be Naive Bayes, if set by the options. During labeling, this field is a reference to the Naive Bayes classification algorithm (so that that algorithm will be used by the `classify' method), but it won't be saved to disk as such.


    public
        object
    $final_algorithm

$final_features

The Features subclass instance used to map documents at classification time to the feature vectors expected by classification algorithms. This will generally be a reduced feature set, just like that used during labeling, but potentially larger than the set used by Naive Bayes.


    public
        object
    $final_features

$finalized

Finalization status, as determined by one of the three finalization constants.


    public
        int
    $finalized
     = 0

$fresh

Whether or not this classifier has had any training examples added to it, and consequently whether or not its Naive Bayes classification algorithm has every been trained.


    public
        bool
    $fresh
     = true

$full_features

The Features subclass instance used to manage the full set of features seen across all documents in the training set.


    public
        object
    $full_features

$label_algorithm

The NaiveBayes classification algorithm used during training to tentatively classify documents presented to the user for labeling.


    public
        object
    $label_algorithm

$label_features

The Features subclass instance used to manage the reduced set of features used only by Naive Bayes classification algorithms during the labeling phase.


    public
        object
    $label_features

$lang

Language of documents in the training set (also how new documents will be treated).


    public
        string
    $lang

$loaded_properties

The names of properties set by one of the prepareTo* methods; these properties will be saved back to disk during serialization, while all other properties not listed by the __sleep method will be discarded.


    public
        array<string|int, mixed>
    $loaded_properties
     = []

$negative

The number of negative examples in the training set.


    public
        int
    $negative
     = 0

$options

Default per-classifier options, which may be overridden when constructing a new classifier. The supported options are:


    public
        array<string|int, mixed>
    $options
     = ['density' => ['lambda' => 0.5, 'beta' => 3.0], 'threshold' => 0.5, 'label_fs' => ['max' => 30], 'final_fs' => ['max' => 200], 'final_algo' => 'lr']

float density.lambda: Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).

float density.beta: Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).

int label_fs.max: Use the `label_fs' most informative features to train the Naive Bayes classifiers used during labeling to compute disagreement for a document.

float threshold: Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= `threshold' are classified as positive instances.

string final_algo: Algorithm to use for finalization; 'lr' for logistic regression, or 'nb' for Naive Bayes; default 'lr'.

int final_fs.max: Use the `final_fs' most informative features to train the final classifier.

$positive

The number of positive examples in the training set.


    public
        int
    $positive
     = 0

$timestamp

Creation time as a UNIX timestamp.


    public
        int
    $timestamp

$total

The total number of examples in the training set (sum of positive and negative).


    public
        int
    $total
     = 0

__construct()

Initializes a new classifier with a class label, and options to override the defaults. The timestamp associated with the classifier is taken from the time of construction.


    public
                    __construct(string $label[, array<string|int, mixed> $options = [] ]) : mixed

Parameters

$label : string: class label applied to positive instances of the class this classifier is trained to recognize
$options : array<string|int, mixed> = []: optional associative array of options that will override the default options

Return values

mixed —

__sleep()

Magic method that determines which member data will be stored when serializing this class. Only lightweight summary data are stored with the serialized version of this class. The heavier-weight properties are stored in individual, compressed files.


    public
                    __sleep() : array<string|int, mixed>

Return values

array<string|int, mixed> —

names of properties to store when serializing this instance

addAllDocuments()

Iterates entirely through a crawl mix iterator, adding each document (that hasn't already been labeled) to the training set with a single label. This function works by running through the iterator, filling up the candidate buffer with all unlabeled documents, then repeatedly dropping the first buffer document and adding it to the training set.


    public
                    addAllDocuments(object $mix_iterator, int $label[, int $limit = INF ]) : int

Returns the total number of newly-labeled documents.

Parameters

$mix_iterator : object: crawl mix iterator to draw documents from
$label : int: label to apply to every document; -1 or 1, but NOT 0
$limit : int = INF: optional upper bound on the number of documents to add; defaults to no limit

Return values

int —

total number of newly-labeled documents

addBufferDoc()

Adds a page to the end of the candidate buffer, keeping the associated statistics up to date. During active training, each document in the buffer is tokenized, and the terms weighted by frequency; the term frequencies across documents in the buffer are tracked as well. With no active training, the buffer is simply an array of page summaries.


    public
                    addBufferDoc(array<string|int, mixed> $page[, bool $is_active = true ]) : mixed

Parameters

$page : array<string|int, mixed>: page summary for the document to add to the buffer
$is_active : bool = true: whether this operation is part of active training, in which case some extra statistics must be maintained

Return values

mixed —

classify()

Classifies a page summary using the current final classification algorithm and features, and returns the classification score. This method is also used during the labeling phase to provide a tentative label for candidates, and in this case the final algorithm is actually a reference to a Naive Bayes instance and final_features is a reference to label_features; neither of these gets saved to disk, however.


    public
                    classify(array<string|int, mixed> $page) : float

Parameters

$page : array<string|int, mixed>: page summary array for the page to be classified

Return values

float —

pseudo-probability that the page is a positive instance of the target class

cleanLabel()

Removes all but alphanumeric characters and underscores from a label, so that it may be easily saved to disk and used in queries as a meta word.


    public
            static        cleanLabel(string $label) : mixed

Parameters

$label : string: class label to clean

Return values

mixed —

computeBufferDensities()

Computes from scratch the buffer densities of the documents in the current candidate pool. This is an expensive operation that requires the computation of the KL-divergence between each ordered pair of documents in the pool, approximately O(N^2) computations, total (where N is the number of documents in the pool). The densities are saved in the buffer data structure.


    public
                    computeBufferDensities() : mixed

The density of a document is approximated by its average overlap with every other document in the candidate buffer, where the overlap between two documents is itself approximated using the exponential, negative KL-divergence between them. The KL-divergence is smoothed to deal with features (terms) that occur in one distribution (document) but not the other, and then multiplied by a negative constant and exponentiated in order to convert it to a kind of linear overlap score.

Return values

mixed —

deleteClassifier()

Deletes the directory corresponding to a class label, and all of its contents. In the case that there is no classifier with the passed in label, does nothing.


    public
            static        deleteClassifier(string $label) : mixed

Parameters

$label : string: class label of the classifier to be deleted

Return values

mixed —

dropBufferDoc()

Removes the document at the front of the candidate buffer. During active training the cross-document statistics for terms occurring in the document being removed are maintained.


    public
                    dropBufferDoc([bool $is_active = true ]) : mixed

Parameters

$is_active : bool = true: whether this operation is part of active training, in which case some extra statistics must be maintained

Return values

mixed —

finalize()

Trains the final classification algorithm on the full training set, using a subset of the full feature set. The final algorithm will usually be logistic regression, but can be set to Naive Bayes with the appropriate runtime option. Once finalization completes, updates the `finalized' attribute.


    public
                    finalize() : mixed

Return values

mixed —

findNextDocumentToLabel()

Finds the next best document for labeling amongst the documents in the candidate buffer, moves that candidate to the front of the buffer, and returns it. The best candidate is the one with the maximum product of disagreement and density, where the density has already been calculated for each document in the current pool, and the disagreement is the KL-divergence between the classification scores obtained from a committee of Naive Bayes classifiers, each sampled from the current set of features.


    public
                    findNextDocumentToLabel() : array<string|int, mixed>

Return values

array<string|int, mixed> —

two-element array containing first the best candidate, and second the disagreement score, obtained by dividing the disagreement for the document by the maximum disagreement possible for the committee size

getClassifier()

Returns the minimal classifier instance corresponding to a class label, or null if no such classifier exists on disk.


    public
            static        getClassifier(string $label) : object

Parameters

$label : string: classifier's class label

Return values

object —

classifier instance with the relevant class label, or null if no such classifier exists on disk

getClassifierList()

Returns an array of classifier instances currently stored in the classifiers directory. The array maps class labels to their corresponding classifiers, and each classifier is a minimal instance, containing only summary statistics.


    public
            static        getClassifierList() : array<string|int, mixed>

Return values

array<string|int, mixed> —

associative array of class labels mapped to their corresponding classifier instances

getCrawlMixName()

Returns a name for the crawl mix associated with a class label.


    public
            static        getCrawlMixName(string $label) : string

Parameters

$label : string: class label associated with the crawl mix

Return values

string —

name that can be used for the crawl mix associated with $label

initBuffer()

Drops any existing candidate buffer, re-initializes the buffer structure, then calls refreshBuffer to fill it. Takes an optional buffer size, which can be used to limit the buffer to something other than the number imposed by the runtime parameter. Returns the final buffer size.


    public
                    initBuffer(object $mix_iterator[, int $buffer_size = null ]) : int

Parameters

$mix_iterator : object: crawl mix iterator to draw documents from
$buffer_size : int = null: optional buffer size to use; defaults to the runtime parameter

Return values

int —

final buffer size

klDivergenceToMean()

Calculates the KL-divergence to the mean for a collection of discrete two-element probability distributions. Each distribution is specified by a single probability, p, since the second probability is just 1 - p. The KL-divergence to the mean is used as a measure of disagreement between members of a committee of classifiers, where each member assigns a classification score to the same document.


    public
            static        klDivergenceToMean(array<string|int, mixed> $ps) : float

Parameters

$ps : array<string|int, mixed>: probabilities describing several discrete two-element probability distributions

Return values

float —

KL-divergence to the mean for the collection of distributions

labelDocument()

Updates the buffer and training set to reflect the label given to a new document. The label may be -1, 1, or 0, where the first two correspond to a negative or positive example, and the last to a skip. The handling for a skip is necessarily different from that for a positive or negative label, and matters are further complicated by the possibility that we may be changing a label for a document that's already in the training set, rather than adding a new document. This function returns true if the new label resulted in a change to the training set, and false otherwise (i.e., if the user simply skipped labeling the candidate document).


    public
                    labelDocument(string $key, int $label[, bool $is_active = true ]) : bool

When updating an existing document, we will either need to swap the label in the training set and update the statistics stored by the Features instance (since now the features are associated with a different label), or drop the document from the training set and (again) update the statistics stored by the Features instance. In either case the negative and positive counts must be updated as well.

When working with a new document, we need to remove it from the candidate buffer, and if the label is non-zero then we also need to add the document to the training set. That involves tokenizing the document, passing the tokens through the full_features instance, and storing the resulting feature vector, plus the new label in the docs attribute. The positive and negative counts must be updated as well.

Finally, if this operation is occurring active labeling (when the user is providing labels one at a time), that information needs to be passed along to dropBufferDoc, which can avoid doing some work in the non-active case.

Parameters

$key : string: key used to select the document from the docs array
$label : int: new label (-1, 1, or 0)
$is_active : bool = true: whether this operation is being carried out during active labeling

Return values

bool —

true if the training set was modified, and false otherwise

labelPage()

Given a page summary (passed by reference) and a list of classifiers, augments the summary meta words with the class label of each classifier that scores the summary above a threshold. This static method is used by fetchers to classify downloaded pages. In addition to the class label, the pseudo-probability that the document belongs to the class is recorded as well. This is recorded both as the score rounded down to the nearest multiple of ten, and as "<n>plus" for each multiple of ten, n, less than the score and greater than or equal to the threshold.


    public
            static        labelPage(array<string|int, mixed> &$summary, array<string|int, mixed> $classifiers, array<string|int, mixed> &$active_classifiers, array<string|int, mixed> &$active_rankers) : mixed

As an example, suppose that a classifier with class label `label' has determined that a document is a positive example with pseudo-probability 0.87 and threshold 0.5. The following meta words are added to the summary: class:label, class:label:80, class:label:80plus, class:label:70plus, class:label:60plus, and class:label:50plus.

Parameters

$summary : array<string|int, mixed>: page summary to classify, passed by reference
$classifiers : array<string|int, mixed>: list of Classifier instances, each prepared for classifying (via the prepareToClassify method)
$active_classifiers : array<string|int, mixed>
$active_rankers : array<string|int, mixed>

Return values

mixed —

loadClassifiersData()

Given a list of class labels, returns an array mapping each class label to an array of data necessary for initializing a classifier for that label. This static method is used to prepare a collection of classifiers for distribution to fetchers, so that each fetcher can classify pages as it downloads them. The only extra properties passed along in addition to the base classification data are the final features and final algorithm, both necessary for classifying new documents.


    public
            static        loadClassifiersData(array<string|int, mixed> $labels) : array<string|int, mixed>

Parameters

$labels : array<string|int, mixed>: flat array of class labels for which to load data

Return values

array<string|int, mixed> —

associative array mapping class labels to arrays of data necessary for initializing the associated classifier

loadProperties()

Loads class attributes from compressed, serialized files on disk, and stores their names so that they will be saved back to disk later. Each property (if it has been previously set) is stored in its own file under the classifier's data directory, named after the property. The file is compressed using gzip, but without gzip headers, so it can't actually be decompressed by the standard gzip utility. If a file doesn't exist, then the instance property is left untouched. The property names are passed as a variable number of arguments.


    public
                    loadProperties() : mixed

Return values

mixed —

makeKey()

Returns a key that can be used internally to refer internally to a particular page summary.


    public
            static        makeKey(array<string|int, mixed> $page) : string

Parameters

$page : array<string|int, mixed>: page summary to return a key for

Return values

string —

key that uniquely identifies the page summary

moveBufferDocToFront()

Moves a document in the candidate buffer up to the front, in preparation for a label request. The document is specified by its index in the buffer.


    public
                    moveBufferDocToFront(int $i) : mixed

Parameters

$i : int: document index within the candidate buffer

Return values

mixed —

newClassifierFromData()

The dual of loadClassifiersData, this static method reconstitutes a Classifier instance from an array containing the necessary data. This gets called by each fetcher, using the data that it receives from the name server when establishing a new crawl.


    public
            static        newClassifierFromData(array<string|int, mixed> $data) : object

Parameters

$data : array<string|int, mixed>: associative array mapping property names to their serialized and compressed data

Return values

object —

Classifier instance built from the passed-in data

prepareToClassify()

Prepare to classify new web pages. This operation requires only the final features and classification algorithm, which are expected to be defined after the finalization phase.


    public
                    prepareToClassify() : mixed

Return values

mixed —

prepareToFinalize()

Prepare to train a final classification algorithm on the full training set. This operation requires the full training set and features, but not the candidate buffer used during labeling. Note that any existing final features and classification algorithm are simply zeroed out; they are only loaded from disk so that they will be written back after finalization completes.


    public
                    prepareToFinalize() : mixed

Return values

mixed —

prepareToLabel()

Prepare this classifier instance for labeling. This operation requires all of the heavyweight member data save the final features and algorithm. Note that these properties are set to references to the Naive Bayes features and algorithm, so that Naive Bayes will be used to tentatively classify documents during labeling (purely to give the user some feedback on how the training set is performing).


    public
                    prepareToLabel() : mixed

Return values

mixed —

refreshBuffer()

Adds as many new documents to the candidate buffer as necessary to reach the specified buffer size, which defaults to the runtime parameter.


    public
                    refreshBuffer(object $mix_iterator[, int $buffer_size = null ]) : int

Returns the final buffer size, which may be less than that requested if the iterator doesn't return enough documents.

Parameters

$mix_iterator : object: crawl mix iterator to draw documents from
$buffer_size : int = null: optional buffer size to use; defaults to the runtime parameter

Return values

int —

final buffer size

setClassifier()

Stores a classifier instance to disk, first separating it out into individual files containing serialized and compressed property data. The basic classifier information, such as class label and summary statistics, is stored uncompressed in a file called `classifier.txt'.


    public
            static        setClassifier(object $classifier) : mixed

The classifier directory and all of its contents are made world-writable so that they can be manipulated without hassle from the command line.

Parameters

$classifier : object: Classifier instance to store to disk

Return values

mixed —

storeLoadedProperties()

Stores the data associated with each property name listed in the loaded_properties instance attribute back to disk. The data for each property is stored in its own serialized and compressed file, and made world-writable.


    public
                    storeLoadedProperties() : mixed

Return values

mixed —

tokenizeDescription()

Tokenizes a string into a map from terms to within-string frequencies.


    public
                    tokenizeDescription(string $description) : array<string|int, mixed>

Parameters

$description : string: string to tokenize

Return values

array<string|int, mixed> —

associative array mapping terms to their within-string frequencies

train()

Trains the Naive Bayes classification algorithm used during labeling on the current training set, and optionally updates the estimated accuracy.


    public
                    train([bool $update_accuracy = false ]) : mixed

Parameters

$update_accuracy : bool = false: optional parameter specifying whether or not to update the accuracy estimate after training completes; defaults to false

Return values

mixed —

updateAccuracy()

Estimates current classification accuracy using a Naive Bayes classification algorithm. Accuracy is estimated by splitting the current training set into fifths, reserving four fifths for training, and the remaining fifth for testing. A fresh classifier is trained and tested on these splits, and the total accuracy recorded. Then the splits are rotated so that the previous testing fifth becomes part of the training set, and one of the blocks from the previous training set becomes the testing set. A new classifier is trained and tested on the new splits, and, again, the accuracy recorded. This process is repeated until all blocks have been used for testing, and the average accuracy recorded.


    public
                    updateAccuracy([object $X = null ][, array<string|int, mixed> $y = null ]) : mixed

Parameters

$X : object = null: optional sparse matrix representing the already-mapped training set to use; if not provided, the current training set is mapped using the label_features property
$y : array<string|int, mixed> = null: optional array of document labels corresponding to the training set; if not provided the current training set labels are used

Return values

mixed —

Classifier in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Constants

BUFFER_SIZE

COMMITTEE_SIZE

DENSITY_BETA

DENSITY_LAMBDA

FINALIZED

FINALIZING

MAX_DISAGREEMENT

THRESHOLD

UNFINALIZED

Properties

$accuracy

$buffer

$class_label

$docs

$final_algorithm

$final_features

$finalized

$fresh

$full_features

$label_algorithm

$label_features

$lang

$loaded_properties

$negative

$options

$positive

$timestamp

$total

Methods

__construct()

Parameters

Return values

__sleep()

Return values

addAllDocuments()

Parameters

Return values

addBufferDoc()

Parameters

Return values

classify()

Parameters

Return values

cleanLabel()

Parameters

Return values

computeBufferDensities()

Return values

deleteClassifier()

Parameters

Return values

dropBufferDoc()

Parameters

Return values

finalize()

Return values

findNextDocumentToLabel()

Return values

getClassifier()

Parameters

Return values

getClassifierList()

Return values

getCrawlMixName()

Parameters

Return values

initBuffer()

Parameters

Return values

klDivergenceToMean()

Parameters

Return values

labelDocument()

Parameters

Return values

Classifier
in package

Application

implements CrawlConstants