Classification Algorithms

This section includes:

The code is similar to the Regression code, except the output types are non-negative integers and we use different evaluation metrics. For example, we report a confusion matrix for each classifier, in addition to accuracy.

Base Classifier

This module builds a base class for classification problems, such as logistic regression or k-nearest neighbors classification. The preprocessing (if applicable) is done at this class level.

class classification.Classification(features, output, split_proportion=0.75, number_labels=None, standardized=True)[source]

A class used to represent a classification algorithm.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • output (numpy.ndarray) – Labels of data corresponding to feature matrix.

  • split_proportion (float) – Proportion of data to use for training; between 0 and 1.

  • number_labels (int) – The number of labels present in the data.

  • standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

number_labels

The number of labels present in existing and future data.

Type

int

sample_size

The sample size of all given data (train and test).

Type

int

train_size

The sample size of the training data.

Type

int

test_size

The sample size of the test data.

Type

int

train_rows

The list of indices for the train split.

Type

numpy.ndarray

test_rows

The list of indices for the test split.

Type

numpy.ndarray

train_features

The train design matrix.

Type

numpy.ndarray

test_features

The test design matrix.

Type

numpy.ndarray

train_output

The train output data.

Type

numpy.ndarray

test_output

The test output data.

Type

numpy.ndarray

dimension

The number of dimensions of the data, or columns of design matrix. Does not include output.

Type

int

standardize()[source]

Separately scale/center the train and test data so each feature (column of observations) has 0 mean and unit variance.

Logistic Classifier

This module builds a class for logistic regression problems. We compute the solution by directly maximizing the log-likelihood. To do, we implement the Newton-Raphson method, following the treatment in the 1.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Springer New York Inc..

class logisticreg.Logistic(features, output, split_proportion=0.75, threshold=0.5, number_labels=None, standardized=True, tolerance=0.001, max_steps=100)[source]

A class used to represent a logistic regression classifier. We only list non-inherited attributes. We use an optimizer from scipy to manually compute the maximum likelihood estimator.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • output (numpy.ndarray) – Labels of data corresponding to feature matrix.

  • split_proportion (float) – Proportion of data to use for training; between 0 and 1.

  • threshold (float) – The minimum probability needed to classify a datapoint as a 1.

  • number_labels (int) – The number of labels present in the data.

  • standardized (bool) – Whether to center/scale the data (train/test done separately). True by default. This isn’t strictly necessary, but it may help the Newton-Raphson algorithm converge. 2

tolerance

One of two possible stopping criterion for the Newton-Raphson algorithm. It sets the required change in L_2 norm between successive coefficients. The algorithm will terminate no matter if the max_steps is reached.

Type

float

max_steps

The maximum number of steps permitted for the Newton-Raphson algorithm. The algorithm terminates earlier if the tolerance is achieved, however,

Type

int

coefficients

The coefficients in the logistic regression model.

Type

numpy.ndarray

threshold

The minimum probability needed to classify a datapoint as a 1.

Type

float

train_probs

The predicted probabilities each training observation has label 1.

Type

numpy.ndarray

train_predictions

The classified labels for the training data.

Type

numpy.ndarray

test_probs

The predicted probabilities each test observation has label 1.

Type

numpy.ndarray

test_predictions

The labels predicted for the given test data.

Type

numpy.ndarray

train_accuracy

The accuracy of the classifier evaluated on training data.

Type

float

test_accuracy

The accuracy of the classifier evaluated on test data.

Type

float

train_confusion

The accuracy of the classifier evaluated on training data.

Type

float

test_confusion

A confusion matrix of the classifier evaluated on test data.

Type

numpy.ndarray

References

2

https://stats.stackexchange.com/a/113027s

fit()[source]

Calculate the coefficient estimate on the training data.

Returns

beta_new – The estimated coefficients for logistic regression.

Return type

numpy.ndarray

static newton_raphson_update(features, beta_old, output)[source]

Compute a single step of the Newton-Raphson algorithm to compute the coefficient maximizing the likelihood for the logistic model.

This algorithm is a root finder, i.e., finds a solution to f(x)=0 for some function f. In this case, we set f to be the derivative of the log likelihood (i.e., the score function), to find the MLE.

Parameters
  • beta_old (numpy.ndarray) – The coefficients we would like to update

  • features (numpy.ndarray) – A design matrix of explanatory variables

  • output (numpy.ndarray) – The labels corresponding to our observed features

Returns

beta_new – The updated coefficients.

Return type

numpy.ndarray

static probability_estimate(features, coefficients)[source]

Compute P(y = 1|x, beta) = 1/(1+exp(-beta^Tx)). It can be used for both training and for prediction.

Parameters
  • features (numpy.ndarray) – A design matrix of observations to condition on.

  • coefficients (numpy.ndarray) – A vector of possible coefficients

Returns

estimate – The estimated probability of an label of 1 for each observation, conditional on the data and coefficient.

Return type

float

static weighted_matrix(probabilities)[source]

Compute weight matrix for the weighted least squares problem used in the Newton-Raphson algorithm of solving logistic regression.

Parameters

probabilities (numpy.ndarray) – An array of probabilities from which we compute the weight matrix.

Returns

wt_matrix – Diagonal matrix with ith entries p(1-p), where p = P(y = 1|X = x_i, beta).

Return type

numpy.ndarray

See also

Logistic.newton_raphson_update

Implement a single step of the Newton-Raphson algorithm.

K-Nearest Neighbor Classifier

This module builds a class for k-nearest neighbor classification.

class knn_classify.KNNClassify(features, output, split_proportion=0.75, number_labels=None, standardized=True, k=3, classify=True)[source]

A class used to represent a k-nearest neighbor classifier. We only list non-inherited attributes. We include regression functionality as well.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • output (numpy.ndarray) – Labels of data corresponding to feature matrix.

  • split_proportion (float) – Proportion of data to use for training; between 0 and 1.

  • number_labels (int) – The number of labels present in the data.

  • standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

  • k (int) – The number of neighbors to use in the algorithm.

  • classify (bool) – Whether we are using this class for classification or regression. True by default. We will use instants with classify == False for a KNNRegression class.

k

The number of neighbors to use in the algorithm.

Type

int

test_predictions

The labels predicted for the given test data (for classification).

Type

numpy.ndarray

test_accuracy

The accuracy of the classifier evaluated on test data (for classification).

Type

float

test_confusion

A confusion matrix of the classifier evaluated on test data (for classification).

Type

numpy.ndarray

test_predictions_reg

The predicted output on test data (for regression).

Type

numpy.ndarray

test_error

The test MSE of model fit using training data (for regression).

Type

float

See also

knnregression.KNNRegression

Class for a k-nearest neighbor regression model.

classify_point(current_location)[source]

Classify a new datapoint based on its k neighbors.

Parameters

current_location (numpy.ndarray) – Point we would like to classify, using its neighbors.

Returns

label_mode – The predicted label (mode of labels of the k-nearest neighbors).

Return type

int

Notes

We choose the smallest label by default.

See also

estimate_point

Find average output value among neighbors instead of most common label (for regression).

estimate_point(current_location)[source]

Estimate (for a regression context) a new datapoint based on its k neighbors.

Parameters

current_location (numpy.ndarray) – Point we would like to classify, using its neighbors.

Returns

output_estimate – The predicted output value of the current location.

Return type

int

See also

classify_point

Find most common label among neighbors instead of average output value (for classification).

k_neighbors_idx(current_location)[source]

Find row indices (in given data) of the k closest neighbors to a given data point.

Parameters

current_location (numpy.ndarray) – Point we would like to classify, using its neighbors.

Returns

k_nearest_idx – The k indices of the features observations closest to the current point.

Return type

numpy.ndarray

Notes

An efficient numpy procedure (using its broadcasting functionality) to compute all pairwise differences between two collections of data points is given in 1. We use this, as an alternative to using a manual nested ‘for-loop’ procedure.

References

1(1,2)

https://sparrow.dev/pairwise-distance-in-numpy/

predict_class()[source]

Classify many new datapoints based on their k neighbors.

Parameters

test_features (numpy.ndarray) – Points we would like to classify, using their neighbors.

Returns

test_labels – The predicted labels for each test datapoint.

Return type

numpy.ndarray

See also

predict_value

Predict output value instead of label (for regression).

predict_value(test_features, k)[source]

Classify many new datapoints based on their k neighbors.

Parameters

test_features (numpy.ndarray) – Points we would like to classify, using their neighbors.

Returns

test_estimates – The predicted output for each test datapoint.

Return type

numpy.ndarray

See also

predict_class

Predict label instead of output value (for classification).

Quadratic Discriminant Analysis

This module builds a class for k-nearest neighbor classification.

class qda.QDA(features, output, split_proportion=0.75, number_labels=None, standardized=True)[source]

A class used to represent a quadratic discriminant analysis classifier. We only list non-inherited attributes. We include regression functionality as well.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • output (numpy.ndarray) – Labels of data corresponding to feature matrix.

  • split_proportion (float) – Proportion of data to use for training; between 0 and 1.

  • number_labels (int) – The number of labels present in the data.

  • standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

train_prediction

The labels predicted for the given test data (for classification).

Type

numpy.ndarray

test_predictions

The labels predicted for the given test data (for classification).

Type

numpy.ndarray

train_accuracy

The accuracy of the classifier evaluated on test data

Type

float

test_accuracy

The accuracy of the classifier evaluated on test data (for classification).

Type

float

train_confusion

A confusion matrix of the classifier evaluated on training data (for classification).

Type

numpy.ndarray

test_confusion

A confusion matrix of the classifier evaluated on test data (for classification).

Type

numpy.ndarray

See also

lda.LDA

Use the more restrictive linear discriminant analysis

static class_covariance(features_subset)[source]

Calculate a covariance matrix for a mono-labeled feature array.

Parameters

features_subset (numpy.ndarray) – The design matrix of explanatory variables.

Returns

class_cov – The class-specific covariance matrix for QDA.

Return type

numpy.ndarray

discriminant(point, k)[source]

Evaluate the kth quadratic discriminant function at a point.

Parameters
  • point (numpy.ndarray) – The point to evaluate at

  • k (int) – The class label of interest

Returns

discrim – The value of the discriminant function at this point.

Return type

float

predict_many(points)[source]

Predict the label of a matrix of test points given a trained model.

Parameters

points (numpy.ndarray) – The test datapoints we wish to classify.

Returns

label – The predicted classes of the points.

Return type

int

predict_one(point)[source]

Predict the label of a test point given a trained model.

Parameters

point (numpy.ndarray) – The test datapoint we wish to classify.

Returns

label – The predicted class of the point.

Return type

int

static prior(output, k)[source]

Count the empirical proportion of labels of class k among output data.

Parameters
  • output (numpy.ndarray) – The labels corresponding to some dataset

  • k (int) – The class label we are interested in

Returns

proportion – The fraction of class k observations

Return type

float

Linear Discriminant Analysis

This module builds a class for k-nearest neighbor classification.

class lda.LDA(features, output, split_proportion=0.75, number_labels=None, standardized=True)[source]

A class used to represent a linear discriminant analysis classifier. We only list non-inherited attributes. We include regression functionality as well.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • output (numpy.ndarray) – Labels of data corresponding to feature matrix.

  • split_proportion (float) – Proportion of data to use for training; between 0 and 1.

  • number_labels (int) – The number of labels present in the data.

  • standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

covariance_matrix

The pooled covariance matrix used in the discriminant function

Type

numpy.ndarray

train_prediction

The labels predicted for the given test data (for classification).

Type

numpy.ndarray

test_predictions

The labels predicted for the given test data (for classification).

Type

numpy.ndarray

train_accuracy

The accuracy of the classifier evaluated on test data

Type

float

test_accuracy

The accuracy of the classifier evaluated on test data (for classification).

Type

float

train_confusion

A confusion matrix of the classifier evaluated on training data (for classification).

Type

numpy.ndarray

test_confusion

A confusion matrix of the classifier evaluated on test data (for classification).

Type

numpy.ndarray

See also

qda.QDA

Use the more flexible quadratic discriminant analysis

discriminant(point, k)[source]

Evaluate the kth quadratic discriminant function at a point.

Parameters
  • point (numpy.ndarray) – The point to evaluate at

  • k (int) – The class label of interest

Returns

discrim – The value of the discriminant function at this point.

Return type

float

static pooled_covariance(features, output, num_labels)[source]

Calculate the pooled covariance matrix (used for all classes).

Parameters
  • features (numpy.ndarray) – The design matrix of explanatory variables.

  • output (numpy.ndarray) – The output labels corresponding to features.

  • num_labels (numpy.ndarray) – The number of labels present in the data

Returns

pooled_cov – The pooled covariance matrix for LDA.

Return type

numpy.ndarray

predict_one(point)[source]

Predict the label of a test point given a trained model.

Parameters

point (numpy.ndarray) – The test datapoint we wish to classify.

Returns

label – The predicted class of the point.

Return type

int