Clustering Algorithms¶

Base Clustering Class¶

This module builds a a base class for clustering problems, such as k-means. The preprocessing (if applicable) is done at this class level.

class clustering.Clustering(features, standardized=True)[source]¶

A class used to represent a clustering algorithm.

Parameters

features (numpy.ndarray) – Design matrix of explanatory variables.
standardized (bool) – Whether to center/scale the data. True by default.

sample_size¶

The sample size of all given data (train and test).

Type: int

dimension¶

The number of dimensions of the data, or columns of design matrix. Does not include output.

Type: int

standardize()[source]¶: Separately scale/center the train and test data so each feature (column of observations) has 0 mean and unit variance.

k-Means Clustering Algorithm¶

class kmean.KCluster(features, standardized=True, k=3, threshold=0.01)[source]¶

A class to represent a k-means clustering model.

k¶

How many clusters to use in the algorithm

Type: int

final_clusters_idx¶

Store which cluster each observation belongs to

Type: numpy.ndarray

final_centers¶

A self.k by self.dimension matrix storing the positions of the cluster centers

Type: numpy.ndarray

number_iterations¶

How many steps the algorithm took to converge

Type: int

threshold¶

How small the error needs to be to end the convergence

Type: float

feature_means¶

Store the predicted position each observation is centered around.

Type: numpy.ndarray

static compute_distance_matrix(features, centers)[source]¶

Compute distances of each of n row vectors in a matrix to each of k vectors representing ‘centers.’

Parameters

features (numpy.ndarray) – n by d array, where d is the dimension of dataset, n is the sample size
centers (numpy.ndarray) – k by d array, where k is the number of clusters and d is the dimension

Returns

distance_matrix – An n by k array. The (i,j) entry is the distance from the i-th row of X to the jth row of centers. We use the l_2 Euclidean norm for d dimensions.

Return type

numpy.ndarray

Notes

An efficient numpy procedure (using its broadcasting functionality) to compute all pairwise differences between two collections of data points is given in 4. We use this, as an alternative to using a manual nested ‘for-loop’ procedure.

References

4: https://sparrow.dev/pairwise-distance-in-numpy/

fit()[source]¶: Fit k-means to a dataset up to a predetermined threshold of convergence.

static update_clusters(features, centers)[source]¶

Given a current set of k centers, compute which cluster each row vector of the data matrix is closest to.

Parameters

features (numpy.ndarray) – n by d array, where d is the dimension of dataset, n is the sample size
centers (numpy.ndarray) – k by d array, where k is the number of centers

Returns

closest_center_idx – A vector of length n whose i-th entry is the integer p minimizing distance of ith row of features to the pth cluster point. We use the d-dimensional l_2 Euclidean norm.

Return type

numpy.ndarray

static update_means(features, closest_center_idx, k)[source]¶

Compute the updated means given a set of new clustering centers.

Parameters

features (numpy.ndarray) – n by d array, where d is the dimension of dataset, n is the sample size
closest_center_idx (numpy.ndarray) – n by 1 array
k (int) – The number of centers.

Returns

cluster_means – A k by d array. The i-th row is the d-dimensional element-wise mean of all row vectors of X in cluster i.

Return type

numpy.ndarray