Regression Implementation¶
This section includes:
Base Regression (a barebones framework for regression)
The code is similar to the Classification code, except the output types are arbitrary real numbers, and we use different evaluation metrics.
Base Regression¶
This module builds a a base class for regression problems, such as least squares or k-nearest neighbors regressions. The preprocessing (if applicable) is done at this class level.
- class regression.Regression(features, output, split_proportion=0.75, standardized=True)[source]¶
A class used to represent a regression algorithm.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.
- sample_size¶
The sample size of all given data (train and test).
- Type
int
- train_size¶
The sample size of the training data.
- Type
int
- test_size¶
The sample size of the test data.
- Type
int
- train_rows¶
The list of indices for the train split.
- Type
numpy.ndarray
- test_rows¶
The list of indices for the test split.
- Type
numpy.ndarray
- train_features¶
The train design matrix.
- Type
numpy.ndarray
- test_features¶
The test design matrix.
- Type
numpy.ndarray
- train_output¶
The train output data.
- Type
numpy.ndarray
- test_output¶
The test output data.
- Type
numpy.ndarray
- dimension¶
The number of dimensions of the data, or columns of design matrix. Does not include output.
- Type
int
Linear Regression¶
This module is for performing linear regression.
- class linearreg.Linear(features, output, split_proportion=0.75, standardized=True)[source]¶
A class used to represent a linear regression classifier.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables, not including a column of 1s for the intercept.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
standardized (bool) –
Whether to center/scale the data (train/test done separately). True by default.
Caution
Don’t include the all ones column, as standardization will result in a singular matrix.
- coefficients¶
The coefficients in the logistic regression model. The first coefficient is the intercept.
- Type
numpy.ndarray
- train_predictions¶
The predicted output values for the training data.
- Type
numpy.ndarray
- test_predictions¶
The predicted output values for the test data.
- Type
numpy.ndarray
- train_error¶
The error of model on training data (default is MSE).
- Type
float
- train_error¶
The error of model on test data (default is MSE).
- Type
float
- fit()[source]¶
Calculate the coefficient solving the least squares problem using training data.
- Returns
coefficients – Vector of coefficients of length self.dimension. The first element is the intercept term.
- Return type
numpy.ndarray
- static predict(features, coefficients)[source]¶
Compute estimated output y = X*beta_hat of linear regression.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
coefficients (numpy.ndarray) – Vector of coefficients for least squares solution.
- Returns
prediction – Predicted output for each observation.
- Return type
numpy.ndarray
K-Nearest Neighbor Regression¶
This module builds a class for k-nearest neighbor classification.
- class knnreg.KNNRegression(features, output, split_proportion=0.75, standardized=True, k=3, classify=False)[source]¶
A class used to represent a k-nearest neighbor regressor. The regression methods and attributes can be found in the KNNClassify class.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Labels of data corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.
k (int) – The number of neighbors to use in the algorithm.
Poisson Regression¶
This module builds a class for Poisson regression problems. We compute the solution by directly maximizing the log-likelihood. We use an existing software implementation to globally maximize the likelihood function: BFGS, available in scipy.optimize.minimize(method = ‘BFGS)
- class poissonreg.Poisson(features, output, split_proportion=0.75, standardized=True)[source]¶
A class used to represent a Poisson regression model.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
output (numpy.ndarray) – Count data output corresponding to feature matrix.
split_proportion (float) – Proportion of data to use for training; between 0 and 1.
standardized (bool) – Whether to center/scale the data (train/test done separately). True by default.
- coefficients¶
The coefficients in the Poisson regression model.
- Type
numpy.ndarray
- train_predictions¶
The predicted output values for the training data.
- Type
numpy.ndarray
- test_predictions¶
The predicted output values for the test data.
- Type
numpy.ndarray
- train_error¶
The error of model on training data (default is MSE).
- Type
float
- train_error¶
The error of model on test data (default is MSE).
- Type
float
- static loglikelihood(features, counts, coefficient)[source]¶
Compute empirical log likelihood for each coefficient.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables
counts (numpy.ndarray) – The given response values (integers).
coefficients (numpy.ndarray) – Vector of coefficients for Poisson regression.
- Returns
loglikelihood – The value of the log likelihood with given data and coefficient.
- Return type
float
- static mle_finder(features, counts)[source]¶
Find the MLE for a Poisson regression model; return the coefficient.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables
counts (numpy.ndarray) – The given output counts
- Returns
mle – The coefficient that solves the Poisson regression problem for the given data.
- Return type
float
Notes
We use the BFGS algorithm as implemented in scipy to maximize the log-likelihood. This requires negating the log likelihood as we use minimization.
See also
scipy.optimize.minimize
- static predict(features, coefficients)[source]¶
Compute estimated means of Poisson regression.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
coefficients (numpy.ndarray) – Vector of coefficients for Poisson regression solution.
- Returns
exp_dot_prods – Predicted mean of Poisson distribution
- Return type
numpy.ndarray
Notes
In a Poisson model, we assume Y is Poisson, and that log E[Y|x] = beta^T x.
Here we return our estimate of E[Y|x] for a test data point x. This quantity reflects the mean (and variance of the) count we might expect of the response, conditional on the observed features.