Clustering Algorithms¶
Base Clustering Class¶
This module builds a a base class for clustering problems, such as k-means. The preprocessing (if applicable) is done at this class level.
- class clustering.Clustering(features, standardized=True)[source]¶
A class used to represent a clustering algorithm.
- Parameters
features (numpy.ndarray) – Design matrix of explanatory variables.
standardized (bool) – Whether to center/scale the data. True by default.
- sample_size¶
The sample size of all given data (train and test).
- Type
int
- dimension¶
The number of dimensions of the data, or columns of design matrix. Does not include output.
- Type
int
k-Means Clustering Algorithm¶
- class kmean.KCluster(features, standardized=True, k=3, threshold=0.01)[source]¶
A class to represent a k-means clustering model.
- k¶
How many clusters to use in the algorithm
- Type
int
- final_clusters_idx¶
Store which cluster each observation belongs to
- Type
numpy.ndarray
- final_centers¶
A self.k by self.dimension matrix storing the positions of the cluster centers
- Type
numpy.ndarray
- number_iterations¶
How many steps the algorithm took to converge
- Type
int
- threshold¶
How small the error needs to be to end the convergence
- Type
float
- feature_means¶
Store the predicted position each observation is centered around.
- Type
numpy.ndarray
- static compute_distance_matrix(features, centers)[source]¶
Compute distances of each of n row vectors in a matrix to each of k vectors representing ‘centers.’
- Parameters
features (numpy.ndarray) – n by d array, where d is the dimension of dataset, n is the sample size
centers (numpy.ndarray) – k by d array, where k is the number of clusters and d is the dimension
- Returns
distance_matrix – An n by k array. The (i,j) entry is the distance from the i-th row of X to the jth row of centers. We use the l_2 Euclidean norm for d dimensions.
- Return type
numpy.ndarray
Notes
An efficient numpy procedure (using its broadcasting functionality) to compute all pairwise differences between two collections of data points is given in 4. We use this, as an alternative to using a manual nested ‘for-loop’ procedure.
References
- static update_clusters(features, centers)[source]¶
Given a current set of k centers, compute which cluster each row vector of the data matrix is closest to.
- Parameters
features (numpy.ndarray) – n by d array, where d is the dimension of dataset, n is the sample size
centers (numpy.ndarray) – k by d array, where k is the number of centers
- Returns
closest_center_idx – A vector of length n whose i-th entry is the integer p minimizing distance of ith row of features to the pth cluster point. We use the d-dimensional l_2 Euclidean norm.
- Return type
numpy.ndarray
- static update_means(features, closest_center_idx, k)[source]¶
Compute the updated means given a set of new clustering centers.
- Parameters
features (numpy.ndarray) – n by d array, where d is the dimension of dataset, n is the sample size
closest_center_idx (numpy.ndarray) – n by 1 array
k (int) – The number of centers.
- Returns
cluster_means – A k by d array. The i-th row is the d-dimensional element-wise mean of all row vectors of X in cluster i.
- Return type
numpy.ndarray