Regression Implementation

This section includes:

The code is similar to the Classification code, except the output types are arbitrary real numbers, and we use different evaluation metrics.

Base Regression

This module builds a a base class for regression problems, such as least squares or k-nearest neighbors regressions. The preprocessing (if applicable) is done at this class level.

class regression.Regression(features, output, split_proportion=0.75, standardized=True)[source]

A class used to represent a regression algorithm.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • output (numpy.ndarray) – Labels of data corresponding to feature matrix.

  • split_proportion (float) – Proportion of data to use for training; between 0 and 1.

  • standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

sample_size

The sample size of all given data (train and test).

Type

int

train_size

The sample size of the training data.

Type

int

test_size

The sample size of the test data.

Type

int

train_rows

The list of indices for the train split.

Type

numpy.ndarray

test_rows

The list of indices for the test split.

Type

numpy.ndarray

train_features

The train design matrix.

Type

numpy.ndarray

test_features

The test design matrix.

Type

numpy.ndarray

train_output

The train output data.

Type

numpy.ndarray

test_output

The test output data.

Type

numpy.ndarray

dimension

The number of dimensions of the data, or columns of design matrix. Does not include output.

Type

int

standardize()[source]

Separately scale/center the train and test data so each feature (column of observations) has 0 mean and unit variance.

Linear Regression

This module is for performing linear regression.

class linearreg.Linear(features, output, split_proportion=0.75, standardized=True)[source]

A class used to represent a linear regression classifier.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables, not including a column of 1s for the intercept.

  • output (numpy.ndarray) – Labels of data corresponding to feature matrix.

  • split_proportion (float) – Proportion of data to use for training; between 0 and 1.

  • standardized (bool) –

    Whether to center/scale the data (train/test done separately). True by default.

    Caution

    Don’t include the all ones column, as standardization will result in a singular matrix.

coefficients

The coefficients in the logistic regression model. The first coefficient is the intercept.

Type

numpy.ndarray

train_predictions

The predicted output values for the training data.

Type

numpy.ndarray

test_predictions

The predicted output values for the test data.

Type

numpy.ndarray

train_error

The error of model on training data (default is MSE).

Type

float

train_error

The error of model on test data (default is MSE).

Type

float

fit()[source]

Calculate the coefficient solving the least squares problem using training data.

Returns

coefficients – Vector of coefficients of length self.dimension. The first element is the intercept term.

Return type

numpy.ndarray

static predict(features, coefficients)[source]

Compute estimated output y = X*beta_hat of linear regression.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • coefficients (numpy.ndarray) – Vector of coefficients for least squares solution.

Returns

prediction – Predicted output for each observation.

Return type

numpy.ndarray

K-Nearest Neighbor Regression

This module builds a class for k-nearest neighbor classification.

class knnreg.KNNRegression(features, output, split_proportion=0.75, standardized=True, k=3, classify=False)[source]

A class used to represent a k-nearest neighbor regressor. The regression methods and attributes can be found in the KNNClassify class.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • output (numpy.ndarray) – Labels of data corresponding to feature matrix.

  • split_proportion (float) – Proportion of data to use for training; between 0 and 1.

  • standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

  • k (int) – The number of neighbors to use in the algorithm.

Poisson Regression

This module builds a class for Poisson regression problems. We compute the solution by directly maximizing the log-likelihood. We use an existing software implementation to globally maximize the likelihood function: BFGS, available in scipy.optimize.minimize(method = ‘BFGS)

class poissonreg.Poisson(features, output, split_proportion=0.75, standardized=True)[source]

A class used to represent a Poisson regression model.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • output (numpy.ndarray) – Count data output corresponding to feature matrix.

  • split_proportion (float) – Proportion of data to use for training; between 0 and 1.

  • standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.

coefficients

The coefficients in the Poisson regression model.

Type

numpy.ndarray

train_predictions

The predicted output values for the training data.

Type

numpy.ndarray

test_predictions

The predicted output values for the test data.

Type

numpy.ndarray

train_error

The error of model on training data (default is MSE).

Type

float

train_error

The error of model on test data (default is MSE).

Type

float

fit()[source]

Calculate the coefficient estimate on the training data.

static loglikelihood(features, counts, coefficient)[source]

Compute empirical log likelihood for each coefficient.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables

  • counts (numpy.ndarray) – The given response values (integers).

  • coefficients (numpy.ndarray) – Vector of coefficients for Poisson regression.

Returns

loglikelihood – The value of the log likelihood with given data and coefficient.

Return type

float

static mle_finder(features, counts)[source]

Find the MLE for a Poisson regression model; return the coefficient.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables

  • counts (numpy.ndarray) – The given output counts

Returns

mle – The coefficient that solves the Poisson regression problem for the given data.

Return type

float

Notes

We use the BFGS algorithm as implemented in scipy to maximize the log-likelihood. This requires negating the log likelihood as we use minimization.

See also

scipy.optimize.minimize

static predict(features, coefficients)[source]

Compute estimated means of Poisson regression.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • coefficients (numpy.ndarray) – Vector of coefficients for Poisson regression solution.

Returns

exp_dot_prods – Predicted mean of Poisson distribution

Return type

numpy.ndarray

Notes

In a Poisson model, we assume Y is Poisson, and that log E[Y|x] = beta^T x.

Here we return our estimate of E[Y|x] for a test data point x. This quantity reflects the mean (and variance of the) count we might expect of the response, conditional on the observed features.