Data Preprocessing

This module provides data processing and preparation tools.

  • Scaling/centering data

  • Generating Test/Train splits

  • Generating Cross validation Folds (In Progress)

  • Generating Bootstrap resamples (In Progress)

Standardize Data

preprocessing.scale_and_center(features)[source]

Center and scale each individual column of the feature matrix to have 0 mean and unit variance.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.

Returns

features – Design matrix of scaled and centered explanatory variables.

Return type

numpy.ndarray

Train/Test Split

preprocessing.train_test_split(features, output, split_proportion)[source]

Split the data into training and testing sets.

Parameters
  • featurs (numpy.ndarray) – Design matrix of explanatory variables

  • output (numpy.ndarray) – The given response variables

  • split_proportion (float) – The proportion of data used for training. Default is 25%. Must lie between 0 and 1.

Returns

split_values

Stores the following values: [‘sample_size’, ‘train_size’,

’test_size’,’train_rows’, ‘test_rows’, ‘train_features’, ‘test_features’, ‘train_output’, ‘test_output’]

Return type

namedTuple

Notes

I used 2 to figure out how to randomly choose rows of an array.

References

2

https://stackoverflow.com/a/14262743

Cross-Validation Folds

preprocessing.cross_validation_folds_idx(row_count, fold_count)[source]

Partition the (training) dataset into folds.

Parameters
  • row_count (int) – Sample size of training data we form folds from.

  • fold_count (int) – The number of folds to produce; cannot exceed row_count.

Returns

folds – Each row stores the indices for a fold, with column size equal to fold size

Return type

numpy.ndarray

Raises

AssertionError – If more folds are requested than there are observations