Data Preprocessing¶

This module provides data processing and preparation tools.

Scaling/centering data
Generating Test/Train splits
Generating Cross validation Folds (In Progress)
Generating Bootstrap resamples (In Progress)

Standardize Data¶

preprocessing.scale_and_center(features)[source]¶

Center and scale each individual column of the feature matrix to have 0 mean and unit variance.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.

Returns

features – Design matrix of scaled and centered explanatory variables.

Return type

numpy.ndarray

Train/Test Split¶

preprocessing.train_test_split(features, output, split_proportion)[source]¶

Split the data into training and testing sets.

Parameters

featurs (numpy.ndarray) – Design matrix of explanatory variables

output (numpy.ndarray) – The given response variables

split_proportion (float) – The proportion of data used for training. Default is 25%. Must lie between 0 and 1.

Returns

split_values –

Stores the following values: [‘sample_size’, ‘train_size’,
’test_size’,’train_rows’, ‘test_rows’, ‘train_features’, ‘test_features’, ‘train_output’, ‘test_output’]

Return type

namedTuple

Notes

I used 2 to figure out how to randomly choose rows of an array.

References

2

https://stackoverflow.com/a/14262743

Cross-Validation Folds¶

preprocessing.cross_validation_folds_idx(row_count, fold_count)[source]¶

Partition the (training) dataset into folds.

Parameters

row_count (int) – Sample size of training data we form folds from.

fold_count (int) – The number of folds to produce; cannot exceed row_count.

Returns

folds – Each row stores the indices for a fold, with column size equal to fold size

Return type

numpy.ndarray

Raises

AssertionError – If more folds are requested than there are observations