LassoRegression
extends ClassifierAlgorithm
in package
Implements the logistic regression text classification algorithm using lasso regression and a cyclic coordinate descent optimization step.
This algorithm is rather slow to converge for large datasets or a large number of features, but it does provide regularization in order to combat over-fitting, and out-performs Naive-Bayes in tests on the same data set. The algorithm augments a standard cyclic coordinate descent approach by ``sleeping'' features that don't significantly change during a single step. Each time an optimization step for a feature doesn't change the feature weight beyond some threshold, that feature is forced to sit out the next optimization round. The threshold increases over successive rounds, effectively placing an upper limit on the number of iterations over all features, while simultaneously limiting the number of features updated on each round. This optimization speeds up convergence, but at the cost of some accuracy.
Tags
Table of Contents
- $beta : array<string|int, mixed>
- Beta vector of feature weights resulting from the training phase. The dot product of this vector with a feature vector yields the log likelihood that the feature vector describes a document belonging to the trained-for class.
- $debug : int
- Level of detail to be used for logging. Higher values mean more detail.
- $epsilon : float
- Threshold used to determine convergence.
- $lambda : float
- Lambda parameter to CLG algorithm.
- classify() : mixed
- Returns the pseudo-probability that a new instance is a positive example of the class the beta vector was trained to recognize. It only makes sense to try classification after at least some training has been done on a dataset that includes both positive and negative examples of the target class.
- computeApproxLikelihood() : array<string|int, mixed>
- Computes the approximate likelihood of y given a single feature, and returns it as a pair <numerator, denominator>.
- estimateLambdaNorm() : float
- Estimates the lambda parameter from the dataset.
- log() : mixed
- Write a message to log file depending on debug level for this subpackage
- score() : float
- Computes an approximate score that can be used to get an idea of how much a given optimization step improved the likelihood of the data set.
- train() : mixed
- An adaptation of the Zhang-Oles 2001 CLG algorithm by Genkin et al. to use the Laplace prior for parameter regularization. On completion, optimizes the beta vector to maximize the likelihood of the data set.
Properties
$beta
Beta vector of feature weights resulting from the training phase. The dot product of this vector with a feature vector yields the log likelihood that the feature vector describes a document belonging to the trained-for class.
public
array<string|int, mixed>
$beta
$debug
Level of detail to be used for logging. Higher values mean more detail.
public
int
$debug
= 0
$epsilon
Threshold used to determine convergence.
public
float
$epsilon
= 0.001
$lambda
Lambda parameter to CLG algorithm.
public
float
$lambda
= 1.0
Methods
classify()
Returns the pseudo-probability that a new instance is a positive example of the class the beta vector was trained to recognize. It only makes sense to try classification after at least some training has been done on a dataset that includes both positive and negative examples of the target class.
public
classify(array<string|int, mixed> $x) : mixed
Parameters
- $x : array<string|int, mixed>
-
feature vector represented by an associative array mapping features to their weights
Return values
mixed —computeApproxLikelihood()
Computes the approximate likelihood of y given a single feature, and returns it as a pair <numerator, denominator>.
public
computeApproxLikelihood(object $Xj, array<string|int, mixed> $y, array<string|int, mixed> $r, float $d) : array<string|int, mixed>
Parameters
- $Xj : object
-
iterator over the non-zero entries in column j of the data
- $y : array<string|int, mixed>
-
labels corresponding to entries in $Xj; each label is 1 if example i has the target label, and -1 otherwise
- $r : array<string|int, mixed>
-
cached dot products of the beta vector and feature weights for each example i
- $d : float
-
trust region for feature j
Return values
array<string|int, mixed> —two-element array containing the numerator and denominator of the likelihood
estimateLambdaNorm()
Estimates the lambda parameter from the dataset.
public
estimateLambdaNorm(object $invX) : float
Parameters
- $invX : object
-
inverted X matrix for dataset (essentially a posting list of features in X)
Return values
float —lambda estimate
log()
Write a message to log file depending on debug level for this subpackage
public
log(string $message) : mixed
Parameters
- $message : string
-
what to write to the log
Return values
mixed —score()
Computes an approximate score that can be used to get an idea of how much a given optimization step improved the likelihood of the data set.
public
score(array<string|int, mixed> $r, array<string|int, mixed> $y, array<string|int, mixed> $beta) : float
Parameters
- $r : array<string|int, mixed>
-
cached dot products of the beta vector and feature weights for each example i
- $y : array<string|int, mixed>
-
labels for each example
- $beta : array<string|int, mixed>
-
beta vector of feature weights (used to penalize large weights)
Return values
float —value proportional to the likelihood of the data, penalized by the magnitude of the beta vector
train()
An adaptation of the Zhang-Oles 2001 CLG algorithm by Genkin et al. to use the Laplace prior for parameter regularization. On completion, optimizes the beta vector to maximize the likelihood of the data set.
public
train(object $X, array<string|int, mixed> $y) : mixed
Parameters
- $X : object
-
SparseMatrix representing the training dataset
- $y : array<string|int, mixed>
-
array of known labels corresponding to the rows of $X