Classification Algorithms¶
This section includes:
Base Classifier (a barebones framework for classification)
The code is similar to the Regression code, except the output types are non-negative integers and we use different evaluation metrics. For example, we report a confusion matrix for each classifier, in addition to accuracy.
Base Classifier¶
This module builds a base class for classification problems, such as logistic regression or k-nearest neighbors classification. The preprocessing (if applicable) is done at this class level.
- class classification.Classification(features, output, split_proportion=0.75, number_labels=None, standardized=True)[source]¶
A class used to represent a classification algorithm.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.
- number_labels¶
The number of labels present in existing and future data.
- Type
int
- sample_size¶
The sample size of all given data (train and test).
- Type
int
- train_size¶
The sample size of the training data.
- Type
int
- test_size¶
The sample size of the test data.
- Type
int
- train_rows¶
The list of indices for the train split.
- Type
numpy.ndarray
- test_rows¶
The list of indices for the test split.
- Type
numpy.ndarray
- train_features¶
The train design matrix.
- Type
numpy.ndarray
- test_features¶
The test design matrix.
- Type
numpy.ndarray
- train_output¶
The train output data.
- Type
numpy.ndarray
- test_output¶
The test output data.
- Type
numpy.ndarray
- dimension¶
The number of dimensions of the data, or columns of design matrix. Does not include output.
- Type
int
Logistic Classifier¶
This module builds a class for logistic regression problems. We compute the solution by directly maximizing the log-likelihood. To do, we implement the Newton-Raphson method, following the treatment in the 1.
References
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Springer New York Inc..
- class logisticreg.Logistic(features, output, split_proportion=0.75, threshold=0.5, number_labels=None, standardized=True, tolerance=0.001, max_steps=100)[source]¶
A class used to represent a logistic regression classifier. We only list non-inherited attributes. We use an optimizer from scipy to manually compute the maximum likelihood estimator.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
threshold (float) – The minimum probability needed to classify a datapoint as a 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default. This isn’t strictly necessary, but it may help the Newton-Raphson algorithm converge. 2
- tolerance¶
One of two possible stopping criterion for the Newton-Raphson algorithm. It sets the required change in L_2 norm between successive coefficients. The algorithm will terminate no matter if the max_steps is reached.
- Type
float
- max_steps¶
The maximum number of steps permitted for the Newton-Raphson algorithm. The algorithm terminates earlier if the tolerance is achieved, however,
- Type
int
- coefficients¶
The coefficients in the logistic regression model.
- Type
numpy.ndarray
- threshold¶
The minimum probability needed to classify a datapoint as a 1.
- Type
float
- train_probs¶
The predicted probabilities each training observation has label 1.
- Type
numpy.ndarray
- train_predictions¶
The classified labels for the training data.
- Type
numpy.ndarray
- test_probs¶
The predicted probabilities each test observation has label 1.
- Type
numpy.ndarray
- test_predictions¶
The labels predicted for the given test data.
- Type
numpy.ndarray
- train_accuracy¶
The accuracy of the classifier evaluated on training data.
- Type
float
- test_accuracy¶
The accuracy of the classifier evaluated on test data.
- Type
float
- train_confusion¶
The accuracy of the classifier evaluated on training data.
- Type
float
- test_confusion¶
A confusion matrix of the classifier evaluated on test data.
- Type
numpy.ndarray
References
- fit()[source]¶
Calculate the coefficient estimate on the training data.
- Returns
beta_new – The estimated coefficients for logistic regression.
- Return type
numpy.ndarray
- static newton_raphson_update(features, beta_old, output)[source]¶
Compute a single step of the Newton-Raphson algorithm to compute the coefficient maximizing the likelihood for the logistic model.
This algorithm is a root finder, i.e., finds a solution to f(x)=0 for some function f. In this case, we set f to be the derivative of the log likelihood (i.e., the score function), to find the MLE.
- Parameters
beta_old (numpy.ndarray) – The coefficients we would like to update
features (numpy.ndarray) – A design matrix of explanatory variables
output (numpy.ndarray) – The labels corresponding to our observed features
- Returns
beta_new – The updated coefficients.
- Return type
numpy.ndarray
- static probability_estimate(features, coefficients)[source]¶
Compute P(y = 1|x, beta) = 1/(1+exp(-beta^Tx)). It can be used for both training and for prediction.
- Parameters
features (numpy.ndarray) – A design matrix of observations to condition on.
coefficients (numpy.ndarray) – A vector of possible coefficients
- Returns
estimate – The estimated probability of an label of 1 for each observation, conditional on the data and coefficient.
- Return type
float
- static weighted_matrix(probabilities)[source]¶
Compute weight matrix for the weighted least squares problem used in the Newton-Raphson algorithm of solving logistic regression.
- Parameters
probabilities (numpy.ndarray) – An array of probabilities from which we compute the weight matrix.
- Returns
wt_matrix – Diagonal matrix with ith entries p(1-p), where p = P(y = 1|X = x_i, beta).
- Return type
numpy.ndarray
See also
Logistic.newton_raphson_updateImplement a single step of the Newton-Raphson algorithm.
K-Nearest Neighbor Classifier¶
This module builds a class for k-nearest neighbor classification.
- class knn_classify.KNNClassify(features, output, split_proportion=0.75, number_labels=None, standardized=True, k=3, classify=True)[source]¶
A class used to represent a k-nearest neighbor classifier. We only list non-inherited attributes. We include regression functionality as well.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.
k (int) – The number of neighbors to use in the algorithm.
classify (bool) – Whether we are using this class for classification or regression. True by default. We will use instants with classify == False for a KNNRegression class.
- k¶
The number of neighbors to use in the algorithm.
- Type
int
- test_predictions¶
The labels predicted for the given test data (for classification).
- Type
numpy.ndarray
- test_accuracy¶
The accuracy of the classifier evaluated on test data (for classification).
- Type
float
- test_confusion¶
A confusion matrix of the classifier evaluated on test data (for classification).
- Type
numpy.ndarray
- test_predictions_reg¶
The predicted output on test data (for regression).
- Type
numpy.ndarray
- test_error¶
The test MSE of model fit using training data (for regression).
- Type
float
See also
knnregression.KNNRegressionClass for a k-nearest neighbor regression model.
- classify_point(current_location)[source]¶
Classify a new datapoint based on its k neighbors.
- Parameters
current_location (numpy.ndarray) – Point we would like to classify, using its neighbors.
- Returns
label_mode – The predicted label (mode of labels of the k-nearest neighbors).
- Return type
int
Notes
We choose the smallest label by default.
See also
estimate_pointFind average output value among neighbors instead of most common label (for regression).
- estimate_point(current_location)[source]¶
Estimate (for a regression context) a new datapoint based on its k neighbors.
- Parameters
current_location (numpy.ndarray) – Point we would like to classify, using its neighbors.
- Returns
output_estimate – The predicted output value of the current location.
- Return type
int
See also
classify_pointFind most common label among neighbors instead of average output value (for classification).
- k_neighbors_idx(current_location)[source]¶
Find row indices (in given data) of the k closest neighbors to a given data point.
- Parameters
current_location (numpy.ndarray) – Point we would like to classify, using its neighbors.
- Returns
k_nearest_idx – The k indices of the features observations closest to the current point.
- Return type
numpy.ndarray
Notes
An efficient numpy procedure (using its broadcasting functionality) to compute all pairwise differences between two collections of data points is given in 1. We use this, as an alternative to using a manual nested ‘for-loop’ procedure.
References
- predict_class()[source]¶
Classify many new datapoints based on their k neighbors.
- Parameters
test_features (numpy.ndarray) – Points we would like to classify, using their neighbors.
- Returns
test_labels – The predicted labels for each test datapoint.
- Return type
numpy.ndarray
See also
predict_valuePredict output value instead of label (for regression).
- predict_value(test_features, k)[source]¶
Classify many new datapoints based on their k neighbors.
- Parameters
test_features (numpy.ndarray) – Points we would like to classify, using their neighbors.
- Returns
test_estimates – The predicted output for each test datapoint.
- Return type
numpy.ndarray
See also
predict_classPredict label instead of output value (for classification).
Quadratic Discriminant Analysis¶
This module builds a class for k-nearest neighbor classification.
- class qda.QDA(features, output, split_proportion=0.75, number_labels=None, standardized=True)[source]¶
A class used to represent a quadratic discriminant analysis classifier. We only list non-inherited attributes. We include regression functionality as well.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.
- train_prediction¶
The labels predicted for the given test data (for classification).
- Type
numpy.ndarray
- test_predictions¶
The labels predicted for the given test data (for classification).
- Type
numpy.ndarray
- train_accuracy¶
The accuracy of the classifier evaluated on test data
- Type
float
- test_accuracy¶
The accuracy of the classifier evaluated on test data (for classification).
- Type
float
- train_confusion¶
A confusion matrix of the classifier evaluated on training data (for classification).
- Type
numpy.ndarray
- test_confusion¶
A confusion matrix of the classifier evaluated on test data (for classification).
- Type
numpy.ndarray
See also
lda.LDAUse the more restrictive linear discriminant analysis
- static class_covariance(features_subset)[source]¶
Calculate a covariance matrix for a mono-labeled feature array.
- Parameters
features_subset (numpy.ndarray) – The design matrix of explanatory variables.
- Returns
class_cov – The class-specific covariance matrix for QDA.
- Return type
numpy.ndarray
- discriminant(point, k)[source]¶
Evaluate the kth quadratic discriminant function at a point.
- Parameters
point (numpy.ndarray) – The point to evaluate at
k (int) – The class label of interest
- Returns
discrim – The value of the discriminant function at this point.
- Return type
float
- predict_many(points)[source]¶
Predict the label of a matrix of test points given a trained model.
- Parameters
points (numpy.ndarray) – The test datapoints we wish to classify.
- Returns
label – The predicted classes of the points.
- Return type
int
- predict_one(point)[source]¶
Predict the label of a test point given a trained model.
- Parameters
point (numpy.ndarray) – The test datapoint we wish to classify.
- Returns
label – The predicted class of the point.
- Return type
int
- static prior(output, k)[source]¶
Count the empirical proportion of labels of class k among output data.
- Parameters
output (numpy.ndarray) – The labels corresponding to some dataset
k (int) – The class label we are interested in
- Returns
proportion – The fraction of class k observations
- Return type
float
Linear Discriminant Analysis¶
This module builds a class for k-nearest neighbor classification.
- class lda.LDA(features, output, split_proportion=0.75, number_labels=None, standardized=True)[source]¶
A class used to represent a linear discriminant analysis classifier. We only list non-inherited attributes. We include regression functionality as well.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
number_labels (int) – The number of labels present in the data.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.
- covariance_matrix¶
The pooled covariance matrix used in the discriminant function
- Type
numpy.ndarray
- train_prediction¶
The labels predicted for the given test data (for classification).
- Type
numpy.ndarray
- test_predictions¶
The labels predicted for the given test data (for classification).
- Type
numpy.ndarray
- train_accuracy¶
The accuracy of the classifier evaluated on test data
- Type
float
- test_accuracy¶
The accuracy of the classifier evaluated on test data (for classification).
- Type
float
- train_confusion¶
A confusion matrix of the classifier evaluated on training data (for classification).
- Type
numpy.ndarray
- test_confusion¶
A confusion matrix of the classifier evaluated on test data (for classification).
- Type
numpy.ndarray
See also
qda.QDAUse the more flexible quadratic discriminant analysis
- discriminant(point, k)[source]¶
Evaluate the kth quadratic discriminant function at a point.
- Parameters
point (numpy.ndarray) – The point to evaluate at
k (int) – The class label of interest
- Returns
discrim – The value of the discriminant function at this point.
- Return type
float
- static pooled_covariance(features, output, num_labels)[source]¶
Calculate the pooled covariance matrix (used for all classes).
- Parameters
features (numpy.ndarray) – The design matrix of explanatory variables.
output (numpy.ndarray) – The output labels corresponding to features.
num_labels (numpy.ndarray) – The number of labels present in the data
- Returns
pooled_cov – The pooled covariance matrix for LDA.
- Return type
numpy.ndarray