Yioop_V9.5_Source_Code

Features
in package

Application

Manages a dataset's features, providing a standard interface for converting documents to feature vectors, and for accessing feature statistics.

Each document in the training set is expected to be fed through an instance of a subclass of this abstract class in order to convert it to a feature vector. Terms are replaced with feature indices (e.g., 'Pythagorean' => 1, 'theorem' => 2, and so on), which are contiguous. The value at a feature index is determined by the subclass; one might weight terms according to how often they occur in the document, while another might use a simple binary representation. The feature index 0 is reserved for an intercept term, which always has a value of one.

$feature_map

Maps old feature indices to new ones when a feature subset operation has been applied to restrict the number of features.


    public
        array<string|int, mixed>
    $feature_map

$label_freqs

Maps labels to the number of documents they're assigned to.


    public
        array<string|int, mixed>
    $label_freqs
     = [-1 => 0, 1 => 0]

$top_terms

A list of the top terms according to the last feature subset operation, if any.


    public
        array<string|int, mixed>
    $top_terms
     = []

$var_freqs

Maps terms to how often they occur in documents by label.


    public
        array<string|int, mixed>
    $var_freqs
     = []

$vocab

Maps terms to their feature indices, which start at 1.


    public
        array<string|int, mixed>
    $vocab
     = []

addExample()

Maps a new example to a feature vector, adding any new terms to the vocabulary, and updating term and label statistics. The example should be an array of terms and their counts, and the output simply replaces terms with feature indices.


    public
                    addExample(array<string|int, mixed> $terms, int $label) : array<string|int, mixed>

Parameters

$terms : array<string|int, mixed>: array of terms mapped to the number of times they occur in the example
$label : int: label for this example, either -1 or 1

Return values

array<string|int, mixed> —

input example with terms replaced by feature indices

labelStats()

Returns the positive and negative label counts for the training set.


    public
                    labelStats() : array<string|int, mixed>

Return values

array<string|int, mixed> —

positive and negative label counts indexed by label, either 1 or -1

mapDocument()

Maps a vector of terms mapped to their counts within a single document to a transformed feature vector, exactly like a row in the sparse matrix returned by mapTrainingSet. This method is used to transform a tokenized document prior to classification.


    public
    abstract                mapDocument(array<string|int, mixed> $tokens) : array<string|int, mixed>

Parameters

$tokens : array<string|int, mixed>: associative array of terms mapped to their within-document counts

Return values

array<string|int, mixed> —

feature vector corresponding to the tokens, mapped according to the implementation of a particular Features subclass

mapToRestrictedFeatures()

Maps the indices of a feature vector to those used by a restricted feature set, dropping and features that aren't in the map. If this Features instance isn't restricted, then the passed-in features are returned unmodified.


    public
                    mapToRestrictedFeatures(array<string|int, mixed> $features) : array<string|int, mixed>

Parameters

$features : array<string|int, mixed>: feature vector mapping feature indices to frequencies

Return values

array<string|int, mixed> —

original feature vector with indices mapped according to the feature_map property, and any features that don't occur in feature_map dropped

mapTrainingSet()

Given an array of feature vectors mapping feature indices to counts, returns a sparse matrix representing the dataset transformed according to the specific Features subclass. A Features subclass might use simple binary features, but it might also use some form of TF * IDF, which requires the full dataset in order to assign weights to particular document features; thus the necessity of a map over the entire training set prior to its input to a classification algorithm.


    public
    abstract                mapTrainingSet(array<string|int, mixed> $docs) : object

Parameters

$docs : array<string|int, mixed>: array of training examples represented as feature vectors where the values are per-example counts

Return values

object —

SparseMatrix instance whose rows are the transformed feature vectors

numFeatures()

Returns the number of features, not including the intercept term represented by feature zero. For example, if we had features 0..10, this function would return 10.


    public
                    numFeatures() : int

Return values

int —

the number of features in the training set

restrict()

Given a FeatureSelection instance, return a new clone of this Features instance using a restricted feature subset. The new Features instance is augmented with a feature map that it can use to convert feature indices from the larger feature set to indices for the reduced set.


    public
                    restrict(object $fs) : object

Parameters

$fs : object: FeatureSelection instance to be used to select the most informative terms

Return values

object —

new Features instance using the restricted feature set

updateExampleLabel()

Updates the label and term statistics to reflect a label change for an example from the training set. A new label of 0 indicates that the example is being removed entirely. Note that term statistics only count one occurrence of a term per example.


    public
                    updateExampleLabel(array<string|int, mixed> $features, int $old_label, int $new_label) : mixed

Parameters

$features : array<string|int, mixed>: feature vector from when the example was originally added
$old_label : int: old example label in {-1, 1}
$new_label : int: new example label in {-1, 0, 1}, where 0 indicates that the example should be removed entirely

Return values

mixed —

varStats()

Returns the statistics for a particular feature and label in the training set. The statistics are counts of how often the term appears or fails to appear in examples with or without the target label. They are returned in a flat array, in the following order:


    public
                    varStats(int $j, int $label) : array<string|int, mixed>

0 => # examples where feature present, label matches 1 => # examples where feature present, label doesn't match 2 => # examples where feature absent, label matches 3 => # examples where feature absent, label doesn't match

Parameters

$j : int: feature index
$label : int: target label

Return values

array<string|int, mixed> —

feature statistics in 4-element flat array

Yioop_V9.5_Source_Code_Documentation

Features
in package

Application

Tags

Table of Contents

Properties

$feature_map

$label_freqs

$top_terms

$var_freqs

$vocab

Methods

addExample()

Parameters

Return values

labelStats()

Return values

mapDocument()

Parameters

Return values

mapToRestrictedFeatures()

Parameters

Return values

mapTrainingSet()

Parameters

Return values

numFeatures()

Return values

restrict()

Parameters

Return values

updateExampleLabel()

Parameters

Return values

varStats()

Parameters

Return values

Search results

Features in package Application

Tags

Table of Contents

Properties

$feature_map

$label_freqs

$top_terms

$var_freqs

$vocab

Methods

addExample()

Parameters

Return values

labelStats()

Return values

mapDocument()

Parameters

Return values

mapToRestrictedFeatures()

Parameters

Return values

mapTrainingSet()

Parameters

Return values

numFeatures()

Return values

restrict()

Parameters

Return values

updateExampleLabel()

Parameters

Return values

varStats()

Parameters

Return values

Features
in package

Application