Data Preprocessing¶
This module provides data processing and preparation tools.
Scaling/centering data
Generating Test/Train splits
Generating Cross validation Folds (In Progress)
Generating Bootstrap resamples (In Progress)
Standardize Data¶
- preprocessing.scale_and_center(features)[source]¶
Center and scale each individual column of the feature matrix to have 0 mean and unit variance.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
- Returns
features – Design matrix of scaled and centered explanatory variables.
- Return type
numpy.ndarray
Train/Test Split¶
- preprocessing.train_test_split(features, output, split_proportion)[source]¶
Split the data into training and testing sets.
- Parameters
featurs (numpy.ndarray) – Design matrix of explanatory variables
output (numpy.ndarray) – The given response variables
split_proportion (float) – The proportion of data used for training. Default is 25%. Must lie between 0 and 1.
- Returns
split_values –
- Stores the following values: [‘sample_size’, ‘train_size’,
’test_size’,’train_rows’, ‘test_rows’, ‘train_features’, ‘test_features’, ‘train_output’, ‘test_output’]
- Return type
namedTuple
Notes
I used 2 to figure out how to randomly choose rows of an array.
References
Cross-Validation Folds¶
- preprocessing.cross_validation_folds_idx(row_count, fold_count)[source]¶
Partition the (training) dataset into folds.
- Parameters
row_count (int) – Sample size of training data we form folds from.
fold_count (int) – The number of folds to produce; cannot exceed row_count.
- Returns
folds – Each row stores the indices for a fold, with column size equal to fold size
- Return type
numpy.ndarray
- Raises
AssertionError – If more folds are requested than there are observations