Classification Algorithms¶

This section includes:

Base Classifier (a barebones framework for classification)
Logistic Classifier
K-Nearest Neighbor Classifier
Quadratic Discriminant Analysis
Linear Discriminant Analysis

The code is similar to the Regression code, except the output types are non-negative integers and we use different evaluation metrics. For example, we report a confusion matrix for each classifier, in addition to accuracy.

Base Classifier¶

This module builds a base class for classification problems, such as logistic regression or k-nearest neighbors classification. The preprocessing (if applicable) is done at this class level.

class classification.Classification(features, output, split_proportion=0.75, number_labels=None, standardized=True)[source]¶

A class used to represent a classification algorithm.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

number_labels¶

The number of labels present in existing and future data.

Type: int

sample_size¶

The sample size of all given data (train and test).

Type: int

train_size¶

The sample size of the training data.

Type: int

test_size¶

The sample size of the test data.

Type: int

train_rows¶

The list of indices for the train split.

Type: numpy.ndarray

test_rows¶

The list of indices for the test split.

Type: numpy.ndarray

train_features¶

The train design matrix.

Type: numpy.ndarray

test_features¶

The test design matrix.

Type: numpy.ndarray

train_output¶

The train output data.

Type: numpy.ndarray

test_output¶

The test output data.

Type: numpy.ndarray

dimension¶

The number of dimensions of the data, or columns of design matrix. Does not include output.

Type: int

standardize()[source]¶: Separately scale/center the train and test data so each feature (column of observations) has 0 mean and unit variance.

Logistic Classifier¶

This module builds a class for logistic regression problems. We compute the solution by directly maximizing the log-likelihood. To do, we implement the Newton-Raphson method, following the treatment in the 1.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Springer New York Inc..

class logisticreg.Logistic(features, output, split_proportion=0.75, threshold=0.5, number_labels=None, standardized=True, tolerance=0.001, max_steps=100)[source]¶

A class used to represent a logistic regression classifier. We only list non-inherited attributes. We use an optimizer from scipy to manually compute the maximum likelihood estimator.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
threshold (float) – The minimum probability needed to classify a datapoint as a 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default. This isn’t strictly necessary, but it may help the Newton-Raphson algorithm converge. 2

tolerance¶

One of two possible stopping criterion for the Newton-Raphson algorithm. It sets the required change in L_2 norm between successive coefficients. The algorithm will terminate no matter if the max_steps is reached.

Type: float

max_steps¶

The maximum number of steps permitted for the Newton-Raphson algorithm. The algorithm terminates earlier if the tolerance is achieved, however,

Type: int

coefficients¶

The coefficients in the logistic regression model.

Type: numpy.ndarray

threshold¶

The minimum probability needed to classify a datapoint as a 1.

Type: float

train_probs¶

The predicted probabilities each training observation has label 1.

Type: numpy.ndarray

train_predictions¶

The classified labels for the training data.

Type: numpy.ndarray

test_probs¶

The predicted probabilities each test observation has label 1.

Type: numpy.ndarray

test_predictions¶

The labels predicted for the given test data.

Type: numpy.ndarray

train_accuracy¶

The accuracy of the classifier evaluated on training data.

Type: float

test_accuracy¶

The accuracy of the classifier evaluated on test data.

Type: float

train_confusion¶

The accuracy of the classifier evaluated on training data.

Type: float

test_confusion¶

A confusion matrix of the classifier evaluated on test data.

Type: numpy.ndarray

References

2: https://stats.stackexchange.com/a/113027s

fit()[source]¶

Calculate the coefficient estimate on the training data.

Returns: beta_new – The estimated coefficients for logistic regression.
Return type: numpy.ndarray

static newton_raphson_update(features, beta_old, output)[source]¶

Compute a single step of the Newton-Raphson algorithm to compute the coefficient maximizing the likelihood for the logistic model.

This algorithm is a root finder, i.e., finds a solution to f(x)=0 for some function f. In this case, we set f to be the derivative of the log likelihood (i.e., the score function), to find the MLE.

Parameters

beta_old (numpy.ndarray) – The coefficients we would like to update
features (numpy.ndarray) – A design matrix of explanatory variables
output (numpy.ndarray) – The labels corresponding to our observed features

Returns

beta_new – The updated coefficients.

Return type

numpy.ndarray

static probability_estimate(features, coefficients)[source]¶

Compute P(y = 1|x, beta) = 1/(1+exp(-beta^Tx)). It can be used for both training and for prediction.

Parameters

features (numpy.ndarray) – A design matrix of observations to condition on.
coefficients (numpy.ndarray) – A vector of possible coefficients

Returns

estimate – The estimated probability of an label of 1 for each observation, conditional on the data and coefficient.

Return type

float

static weighted_matrix(probabilities)[source]¶

Compute weight matrix for the weighted least squares problem used in the Newton-Raphson algorithm of solving logistic regression.

Parameters: probabilities (numpy.ndarray) – An array of probabilities from which we compute the weight matrix.
Returns: wt_matrix – Diagonal matrix with ith entries p(1-p), where p = P(y = 1|X = x_i, beta).
Return type: numpy.ndarray

See also

Logistic.newton_raphson_update: Implement a single step of the Newton-Raphson algorithm.

K-Nearest Neighbor Classifier¶

This module builds a class for k-nearest neighbor classification.

class knn_classify.KNNClassify(features, output, split_proportion=0.75, number_labels=None, standardized=True, k=3, classify=True)[source]¶

A class used to represent a k-nearest neighbor classifier. We only list non-inherited attributes. We include regression functionality as well.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.
k (int) – The number of neighbors to use in the algorithm.
classify (bool) – Whether we are using this class for classification or regression. True by default. We will use instants with classify == False for a KNNRegression class.

k¶

The number of neighbors to use in the algorithm.

Type: int

test_predictions¶

The labels predicted for the given test data (for classification).

Type: numpy.ndarray

test_accuracy¶

The accuracy of the classifier evaluated on test data (for classification).

Type: float

test_confusion¶

A confusion matrix of the classifier evaluated on test data (for classification).

Type: numpy.ndarray

test_predictions_reg¶

The predicted output on test data (for regression).

Type: numpy.ndarray

test_error¶

The test MSE of model fit using training data (for regression).

Type: float

See also

knnregression.KNNRegression: Class for a k-nearest neighbor regression model.

classify_point(current_location)[source]¶

Classify a new datapoint based on its k neighbors.

Parameters: current_location (numpy.ndarray) – Point we would like to classify, using its neighbors.
Returns: label_mode – The predicted label (mode of labels of the k-nearest neighbors).
Return type: int

Notes

We choose the smallest label by default.

See also

estimate_point: Find average output value among neighbors instead of most common label (for regression).

estimate_point(current_location)[source]¶

Estimate (for a regression context) a new datapoint based on its k neighbors.

Parameters: current_location (numpy.ndarray) – Point we would like to classify, using its neighbors.
Returns: output_estimate – The predicted output value of the current location.
Return type: int

See also

classify_point: Find most common label among neighbors instead of average output value (for classification).

k_neighbors_idx(current_location)[source]¶

Find row indices (in given data) of the k closest neighbors to a given data point.

Parameters: current_location (numpy.ndarray) – Point we would like to classify, using its neighbors.
Returns: k_nearest_idx – The k indices of the features observations closest to the current point.
Return type: numpy.ndarray

Notes

An efficient numpy procedure (using its broadcasting functionality) to compute all pairwise differences between two collections of data points is given in 1. We use this, as an alternative to using a manual nested ‘for-loop’ procedure.

References

1(1,2): https://sparrow.dev/pairwise-distance-in-numpy/

predict_class()[source]¶

Classify many new datapoints based on their k neighbors.

Parameters: test_features (numpy.ndarray) – Points we would like to classify, using their neighbors.
Returns: test_labels – The predicted labels for each test datapoint.
Return type: numpy.ndarray

See also

predict_value: Predict output value instead of label (for regression).

predict_value(test_features, k)[source]¶

Classify many new datapoints based on their k neighbors.

Parameters: test_features (numpy.ndarray) – Points we would like to classify, using their neighbors.
Returns: test_estimates – The predicted output for each test datapoint.
Return type: numpy.ndarray

See also

predict_class: Predict label instead of output value (for classification).

Quadratic Discriminant Analysis¶

This module builds a class for k-nearest neighbor classification.

class qda.QDA(features, output, split_proportion=0.75, number_labels=None, standardized=True)[source]¶

A class used to represent a quadratic discriminant analysis classifier. We only list non-inherited attributes. We include regression functionality as well.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

train_prediction¶

The labels predicted for the given test data (for classification).

Type: numpy.ndarray

test_predictions¶

The labels predicted for the given test data (for classification).

Type: numpy.ndarray

train_accuracy¶

The accuracy of the classifier evaluated on test data

Type: float

test_accuracy¶

The accuracy of the classifier evaluated on test data (for classification).

Type: float

train_confusion¶

A confusion matrix of the classifier evaluated on training data (for classification).

Type: numpy.ndarray

test_confusion¶

A confusion matrix of the classifier evaluated on test data (for classification).

Type: numpy.ndarray

See also

lda.LDA: Use the more restrictive linear discriminant analysis

static class_covariance(features_subset)[source]¶

Calculate a covariance matrix for a mono-labeled feature array.

Parameters: features_subset (numpy.ndarray) – The design matrix of explanatory variables.
Returns: class_cov – The class-specific covariance matrix for QDA.
Return type: numpy.ndarray

discriminant(point, k)[source]¶

Evaluate the kth quadratic discriminant function at a point.

Parameters

point (numpy.ndarray) – The point to evaluate at
k (int) – The class label of interest

Returns

discrim – The value of the discriminant function at this point.

Return type

float

predict_many(points)[source]¶

Predict the label of a matrix of test points given a trained model.

Parameters: points (numpy.ndarray) – The test datapoints we wish to classify.
Returns: label – The predicted classes of the points.
Return type: int

predict_one(point)[source]¶

Predict the label of a test point given a trained model.

Parameters: point (numpy.ndarray) – The test datapoint we wish to classify.
Returns: label – The predicted class of the point.
Return type: int

static prior(output, k)[source]¶

Count the empirical proportion of labels of class k among output data.

Parameters

output (numpy.ndarray) – The labels corresponding to some dataset
k (int) – The class label we are interested in

Returns

proportion – The fraction of class k observations

Return type

float

Linear Discriminant Analysis¶

This module builds a class for k-nearest neighbor classification.

class lda.LDA(features, output, split_proportion=0.75, number_labels=None, standardized=True)[source]¶

A class used to represent a linear discriminant analysis classifier. We only list non-inherited attributes. We include regression functionality as well.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

covariance_matrix¶

The pooled covariance matrix used in the discriminant function

Type: numpy.ndarray

train_prediction¶

The labels predicted for the given test data (for classification).

Type: numpy.ndarray

test_predictions¶

The labels predicted for the given test data (for classification).

Type: numpy.ndarray

train_accuracy¶

The accuracy of the classifier evaluated on test data

Type: float

test_accuracy¶

The accuracy of the classifier evaluated on test data (for classification).

Type: float

train_confusion¶

A confusion matrix of the classifier evaluated on training data (for classification).

Type: numpy.ndarray

test_confusion¶

A confusion matrix of the classifier evaluated on test data (for classification).

Type: numpy.ndarray

See also

qda.QDA: Use the more flexible quadratic discriminant analysis

discriminant(point, k)[source]¶

Evaluate the kth quadratic discriminant function at a point.

Parameters

point (numpy.ndarray) – The point to evaluate at
k (int) – The class label of interest

Returns

discrim – The value of the discriminant function at this point.

Return type

float

static pooled_covariance(features, output, num_labels)[source]¶

Calculate the pooled covariance matrix (used for all classes).

Parameters

features (numpy.ndarray) – The design matrix of explanatory variables.
output (numpy.ndarray) – The output labels corresponding to features.
num_labels (numpy.ndarray) – The number of labels present in the data

Returns

pooled_cov – The pooled covariance matrix for LDA.

Return type

numpy.ndarray

predict_one(point)[source]¶

Predict the label of a test point given a trained model.

Parameters: point (numpy.ndarray) – The test datapoint we wish to classify.
Returns: label – The predicted class of the point.
Return type: int