Yioop_V9.5_Source_Code_Documentation

LassoRegression extends ClassifierAlgorithm
in package

Implements the logistic regression text classification algorithm using lasso regression and a cyclic coordinate descent optimization step.

This algorithm is rather slow to converge for large datasets or a large number of features, but it does provide regularization in order to combat over-fitting, and out-performs Naive-Bayes in tests on the same data set. The algorithm augments a standard cyclic coordinate descent approach by ``sleeping'' features that don't significantly change during a single step. Each time an optimization step for a feature doesn't change the feature weight beyond some threshold, that feature is forced to sit out the next optimization round. The threshold increases over successive rounds, effectively placing an upper limit on the number of iterations over all features, while simultaneously limiting the number of features updated on each round. This optimization speeds up convergence, but at the cost of some accuracy.

Tags
author

Shawn Tice

Table of Contents

$beta  : array<string|int, mixed>
Beta vector of feature weights resulting from the training phase. The dot product of this vector with a feature vector yields the log likelihood that the feature vector describes a document belonging to the trained-for class.
$debug  : int
Level of detail to be used for logging. Higher values mean more detail.
$epsilon  : float
Threshold used to determine convergence.
$lambda  : float
Lambda parameter to CLG algorithm.
classify()  : mixed
Returns the pseudo-probability that a new instance is a positive example of the class the beta vector was trained to recognize. It only makes sense to try classification after at least some training has been done on a dataset that includes both positive and negative examples of the target class.
computeApproxLikelihood()  : array<string|int, mixed>
Computes the approximate likelihood of y given a single feature, and returns it as a pair <numerator, denominator>.
estimateLambdaNorm()  : float
Estimates the lambda parameter from the dataset.
log()  : mixed
Write a message to log file depending on debug level for this subpackage
score()  : float
Computes an approximate score that can be used to get an idea of how much a given optimization step improved the likelihood of the data set.
train()  : mixed
An adaptation of the Zhang-Oles 2001 CLG algorithm by Genkin et al. to use the Laplace prior for parameter regularization. On completion, optimizes the beta vector to maximize the likelihood of the data set.

Properties

$beta

Beta vector of feature weights resulting from the training phase. The dot product of this vector with a feature vector yields the log likelihood that the feature vector describes a document belonging to the trained-for class.

public array<string|int, mixed> $beta

$debug

Level of detail to be used for logging. Higher values mean more detail.

public int $debug = 0

$epsilon

Threshold used to determine convergence.

public float $epsilon = 0.001

$lambda

Lambda parameter to CLG algorithm.

public float $lambda = 1.0

Methods

classify()

Returns the pseudo-probability that a new instance is a positive example of the class the beta vector was trained to recognize. It only makes sense to try classification after at least some training has been done on a dataset that includes both positive and negative examples of the target class.

public classify(array<string|int, mixed> $x) : mixed
Parameters
$x : array<string|int, mixed>

feature vector represented by an associative array mapping features to their weights

Return values
mixed

computeApproxLikelihood()

Computes the approximate likelihood of y given a single feature, and returns it as a pair <numerator, denominator>.

public computeApproxLikelihood(object $Xj, array<string|int, mixed> $y, array<string|int, mixed> $r, float $d) : array<string|int, mixed>
Parameters
$Xj : object

iterator over the non-zero entries in column j of the data

$y : array<string|int, mixed>

labels corresponding to entries in $Xj; each label is 1 if example i has the target label, and -1 otherwise

$r : array<string|int, mixed>

cached dot products of the beta vector and feature weights for each example i

$d : float

trust region for feature j

Return values
array<string|int, mixed>

two-element array containing the numerator and denominator of the likelihood

estimateLambdaNorm()

Estimates the lambda parameter from the dataset.

public estimateLambdaNorm(object $invX) : float
Parameters
$invX : object

inverted X matrix for dataset (essentially a posting list of features in X)

Return values
float

lambda estimate

log()

Write a message to log file depending on debug level for this subpackage

public log(string $message) : mixed
Parameters
$message : string

what to write to the log

Return values
mixed

score()

Computes an approximate score that can be used to get an idea of how much a given optimization step improved the likelihood of the data set.

public score(array<string|int, mixed> $r, array<string|int, mixed> $y, array<string|int, mixed> $beta) : float
Parameters
$r : array<string|int, mixed>

cached dot products of the beta vector and feature weights for each example i

$y : array<string|int, mixed>

labels for each example

$beta : array<string|int, mixed>

beta vector of feature weights (used to penalize large weights)

Return values
float

value proportional to the likelihood of the data, penalized by the magnitude of the beta vector

train()

An adaptation of the Zhang-Oles 2001 CLG algorithm by Genkin et al. to use the Laplace prior for parameter regularization. On completion, optimizes the beta vector to maximize the likelihood of the data set.

public train(object $X, array<string|int, mixed> $y) : mixed
Parameters
$X : object

SparseMatrix representing the training dataset

$y : array<string|int, mixed>

array of known labels corresponding to the rows of $X

Return values
mixed

        

Search results