Clustering Algorithms

Base Clustering Class

This module builds a a base class for clustering problems, such as k-means. The preprocessing (if applicable) is done at this class level.

class clustering.Clustering(features, standardized=True)[source]

A class used to represent a clustering algorithm.

Parameters
  • features (numpy.ndarray) – Design matrix of explanatory variables.

  • standardized (bool) – Whether to center/scale the data. True by default.

sample_size

The sample size of all given data (train and test).

Type

int

dimension

The number of dimensions of the data, or columns of design matrix. Does not include output.

Type

int

standardize()[source]

Separately scale/center the train and test data so each feature (column of observations) has 0 mean and unit variance.

k-Means Clustering Algorithm

class kmean.KCluster(features, standardized=True, k=3, threshold=0.01)[source]

A class to represent a k-means clustering model.

k

How many clusters to use in the algorithm

Type

int

final_clusters_idx

Store which cluster each observation belongs to

Type

numpy.ndarray

final_centers

A self.k by self.dimension matrix storing the positions of the cluster centers

Type

numpy.ndarray

number_iterations

How many steps the algorithm took to converge

Type

int

threshold

How small the error needs to be to end the convergence

Type

float

feature_means

Store the predicted position each observation is centered around.

Type

numpy.ndarray

static compute_distance_matrix(features, centers)[source]

Compute distances of each of n row vectors in a matrix to each of k vectors representing ‘centers.’

Parameters
  • features (numpy.ndarray) – n by d array, where d is the dimension of dataset, n is the sample size

  • centers (numpy.ndarray) – k by d array, where k is the number of clusters and d is the dimension

Returns

distance_matrix – An n by k array. The (i,j) entry is the distance from the i-th row of X to the jth row of centers. We use the l_2 Euclidean norm for d dimensions.

Return type

numpy.ndarray

Notes

An efficient numpy procedure (using its broadcasting functionality) to compute all pairwise differences between two collections of data points is given in 4. We use this, as an alternative to using a manual nested ‘for-loop’ procedure.

References

4

https://sparrow.dev/pairwise-distance-in-numpy/

fit()[source]

Fit k-means to a dataset up to a predetermined threshold of convergence.

static update_clusters(features, centers)[source]

Given a current set of k centers, compute which cluster each row vector of the data matrix is closest to.

Parameters
  • features (numpy.ndarray) – n by d array, where d is the dimension of dataset, n is the sample size

  • centers (numpy.ndarray) – k by d array, where k is the number of centers

Returns

closest_center_idx – A vector of length n whose i-th entry is the integer p minimizing distance of ith row of features to the pth cluster point. We use the d-dimensional l_2 Euclidean norm.

Return type

numpy.ndarray

static update_means(features, closest_center_idx, k)[source]

Compute the updated means given a set of new clustering centers.

Parameters
  • features (numpy.ndarray) – n by d array, where d is the dimension of dataset, n is the sample size

  • closest_center_idx (numpy.ndarray) – n by 1 array

  • k (int) – The number of centers.

Returns

cluster_means – A k by d array. The i-th row is the d-dimensional element-wise mean of all row vectors of X in cluster i.

Return type

numpy.ndarray