Regression Implementation¶

This section includes:

Base Regression (a barebones framework for regression)
Linear Regression
K-Nearest Neighbor Regression
Poisson Regression

The code is similar to the Classification code, except the output types are arbitrary real numbers, and we use different evaluation metrics.

Base Regression¶

This module builds a a base class for regression problems, such as least squares or k-nearest neighbors regressions. The preprocessing (if applicable) is done at this class level.

class regression.Regression(features, output, split_proportion=0.75, standardized=True)[source]¶

A class used to represent a regression algorithm.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

sample_size¶

The sample size of all given data (train and test).

Type: int

train_size¶

The sample size of the training data.

Type: int

test_size¶

The sample size of the test data.

Type: int

train_rows¶

The list of indices for the train split.

Type: numpy.ndarray

test_rows¶

The list of indices for the test split.

Type: numpy.ndarray

train_features¶

The train design matrix.

Type: numpy.ndarray

test_features¶

The test design matrix.

Type: numpy.ndarray

train_output¶

The train output data.

Type: numpy.ndarray

test_output¶

The test output data.

Type: numpy.ndarray

dimension¶

The number of dimensions of the data, or columns of design matrix. Does not include output.

Type: int

standardize()[source]¶: Separately scale/center the train and test data so each feature (column of observations) has 0 mean and unit variance.

Linear Regression¶

This module is for performing linear regression.

class linearreg.Linear(features, output, split_proportion=0.75, standardized=True)[source]¶

A class used to represent a linear regression classifier.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables, not including a column of 1s for the intercept.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
standardized (bool) –
Whether to center/scale the data (train/test done separately). True by default.

Caution

Don’t include the all ones column, as standardization will result in a singular matrix.

coefficients¶

The coefficients in the logistic regression model. The first coefficient is the intercept.

Type: numpy.ndarray

train_predictions¶

The predicted output values for the training data.

Type: numpy.ndarray

test_predictions¶

The predicted output values for the test data.

Type: numpy.ndarray

train_error¶

The error of model on training data (default is MSE).

Type: float

train_error¶

The error of model on test data (default is MSE).

Type: float

fit()[source]¶

Calculate the coefficient solving the least squares problem using training data.

Returns: coefficients – Vector of coefficients of length self.dimension. The first element is the intercept term.
Return type: numpy.ndarray

static predict(features, coefficients)[source]¶

Compute estimated output y = X*beta_hat of linear regression.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
coefficients (numpy.ndarray) – Vector of coefficients for least squares solution.

Returns

prediction – Predicted output for each observation.

Return type

numpy.ndarray

K-Nearest Neighbor Regression¶

This module builds a class for k-nearest neighbor classification.

class knnreg.KNNRegression(features, output, split_proportion=0.75, standardized=True, k=3, classify=False)[source]¶

A class used to represent a k-nearest neighbor regressor. The regression methods and attributes can be found in the KNNClassify class.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.
k (int) – The number of neighbors to use in the algorithm.

Poisson Regression¶

This module builds a class for Poisson regression problems. We compute the solution by directly maximizing the log-likelihood. We use an existing software implementation to globally maximize the likelihood function: BFGS, available in scipy.optimize.minimize(method = ‘BFGS)

class poissonreg.Poisson(features, output, split_proportion=0.75, standardized=True)[source]¶

A class used to represent a Poisson regression model.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Count data output corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

coefficients¶

The coefficients in the Poisson regression model.

Type: numpy.ndarray

train_predictions¶

The predicted output values for the training data.

Type: numpy.ndarray

test_predictions¶

The predicted output values for the test data.

Type: numpy.ndarray

train_error¶

The error of model on training data (default is MSE).

Type: float

train_error¶

The error of model on test data (default is MSE).

Type: float

fit()[source]¶: Calculate the coefficient estimate on the training data.

static loglikelihood(features, counts, coefficient)[source]¶

Compute empirical log likelihood for each coefficient.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables
counts (numpy.ndarray) – The given response values (integers).
coefficients (numpy.ndarray) – Vector of coefficients for Poisson regression.

Returns

loglikelihood – The value of the log likelihood with given data and coefficient.

Return type

float

static mle_finder(features, counts)[source]¶

Find the MLE for a Poisson regression model; return the coefficient.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables
counts (numpy.ndarray) – The given output counts

Returns

mle – The coefficient that solves the Poisson regression problem for the given data.

Return type

float

Notes

We use the BFGS algorithm as implemented in scipy to maximize the log-likelihood. This requires negating the log likelihood as we use minimization.