Write a K-means clustering algorithm from scratch.

Hard
5 months ago

Can you write a K-means clustering algorithm from scratch, including initialization, assignment, and update steps? Elaborate on how you would choose 'K' and handle potential issues like empty clusters or convergence.

Sample Answer

K-Means Clustering Algorithm from Scratch

Here's an implementation of the K-means clustering algorithm from scratch in Python. It includes initialization, assignment, and update steps, along with explanations on how to choose 'K' and handle potential issues.

Code Implementation

import numpy as np

class KMeans:
    def __init__(self, k=3, max_iters=100, random_state=None):
        self.k = k
        self.max_iters = max_iters
        self.random_state = random_state
        self.centroids = None

    def fit(self, data):
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        # 1. Initialize centroids randomly
        self.centroids = data[np.random.choice(data.shape[0], self.k, replace=False)]
        
        for _ in range(self.max_iters):
            # 2. Assign data points to the nearest centroid
            clusters = self._assign_clusters(data)
            
            # 3. Update centroids
            new_centroids = self._update_centroids(data, clusters)
            
            # Check for convergence
            if np.allclose(self.centroids, new_centroids):
                break
            
            self.centroids = new_centroids

    def _assign_clusters(self, data):
        clusters = [[] for _ in range(self.k)]
        for i, point in enumerate(data):
            distances = [np.linalg.norm(point - centroid) for centroid in self.centroids]
            cluster_idx = np.argmin(distances)
            clusters[cluster_idx].append(i)
        return clusters

    def _update_centroids(self, data, clusters):
        new_centroids = np.zeros_like(self.centroids)
        for idx, cluster in enumerate(clusters):
            if cluster:
                new_centroids[idx] = np.mean(data[cluster], axis=0)
            else:
                # Handle empty cluster by reinitializing the centroid
                new_centroids[idx] = data[np.random.choice(data.shape[0])]
        return new_centroids

    def predict(self, data):
        predictions = []
        for point in data:
            distances = [np.linalg.norm(point - centroid) for centroid in self.centroids]
            cluster_idx = np.argmin(distances)
            predictions.append(cluster_idx)
        return np.array(predictions)

# Example Usage
data = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])
kmeans = KMeans(k=2, max_iters=100, random_state=42)
kmeans.fit(data)
predictions = kmeans.predict(data)
print("Centroids:", kmeans.centroids)
print("Predictions:", predictions)

Explanation

  1. Initialization:

    • The __init__ method initializes the K-means class with the number of clusters k, the maximum number of iterations max_iters, and a random_state for reproducibility.
    • The centroids are initialized to None.
  2. Fit Method:

    • The fit method takes the data as input and performs the K-means clustering.
    • It initializes the centroids randomly by selecting k data points from the input data.
    • It then iterates max_iters times, performing the assignment and update steps.
  3. Assignment Step:

    • The _assign_clusters method assigns each data point to the nearest centroid.
    • It calculates the Euclidean distance between each data point and each centroid.
    • It assigns the data point to the cluster with the nearest centroid.
  4. Update Step:

    • The _update_centroids method updates the centroids by calculating the mean of the data points in each cluster.
    • If a cluster is empty, the centroid is reinitialized randomly.
  5. Convergence Check:

    • The algorithm checks for convergence by comparing the old centroids to the new centroids.
    • If the centroids have not changed, the algorithm has converged, and the iterations stop.
  6. Prediction:

    • The predict method takes new data and predicts the cluster for each data point based on the trained centroids.

Choosing 'K'

  • Elbow Method: Plot the within-cluster sum of squares (WCSS) for different values of K. Choose the K at the "elbow" point where the WCSS starts to diminish less rapidly.
  • Silhouette Score: Calculate the silhouette score for different values of K. Choose the K with the highest silhouette score, indicating better-defined clusters.
  • Domain Knowledge: Use prior knowledge about the data to select an appropriate value for K.

Handling Potential Issues

  • Empty Clusters:
    • Reinitialize the centroid of the empty cluster to a random data point.
    • Choose a data point from the cluster with the highest sum of squared errors (SSE) and split it into two.
  • Convergence:
    • Set a maximum number of iterations to prevent the algorithm from running indefinitely.
    • Check for convergence by comparing the old centroids to the new centroids.
  • Sensitivity to Initial Centroids:
    • Run the algorithm multiple times with different initial centroids and choose the best result (e.g., the one with the lowest WCSS).
    • Use K-means++ initialization, which selects initial centroids that are far apart from each other.

Big(O) Run-Time Analysis

  • Initialization: O(k) to select initial centroids.
  • Assignment Step: O(n * k), where n is the number of data points and k is the number of clusters. We must iterate through each of the n data points, and for each of them, compute its distance to the k centroids.
  • Update Step: O(n * d), where n is the number of data points in a cluster and d is the dimensionality of the data. We iterate through each cluster, and for each dimension d we calculate the average. A full update across all clusters is therefore O(n*d) since each data point belongs to only 1 cluster.
  • Overall: O(i * n * k * d), where i is the number of iterations. The algorithm iterates until convergence or max_iters is reached. The multiplication by d comes from the need to calculate distances in d dimensional space. In practice i is limited by max_iters. However, it's worth noting that this is the average time complexity. In the worst case, the runtime can be much higher if the algorithm doesn't converge quickly.

Big(O) Space Usage Analysis

  • Data Storage: O(n * d) to store the input data, where n is the number of data points and d is the dimensionality of the data.
  • Centroids Storage: O(k * d) to store the centroids, where k is the number of clusters and d is the dimensionality of the data.
  • Cluster Assignment: O(n) to store the cluster assignments for each data point. In the _assign_clusters method, the list of clusters stores at most n elements because each element can be in at most 1 cluster.
  • Overall: O(n * d + k * d + n). If n*d is significantly larger than k*d and n, then the space complexity simplifies to O(n*d).

Edge Cases and Handling

  1. All Data Points are the Same:

    • If all data points are identical, the algorithm might not converge. Initialize centroids with slight variations or add a small random noise to the data.
  2. k = 1:

    • If k is 1, the algorithm should return a single cluster with the centroid being the mean of all data points. Handle it as a base case.
  3. k = n (Number of Data Points):

    • Each data point becomes its own cluster. The centroids are the data points themselves.
  4. Non-numeric Data:

    • The Euclidean distance calculation requires numeric data. Handle non-numeric data by encoding it into a numerical representation (e.g., one-hot encoding for categorical features) before applying K-means.
  5. Large Datasets:

    • For very large datasets, consider using mini-batch K-means, which updates the centroids using smaller batches of data points, reducing the computational cost per iteration.

This comprehensive implementation and explanation should provide a solid foundation for understanding and implementing the K-means clustering algorithm from scratch.