Can you write a K-means clustering algorithm from scratch, including initialization, assignment, and update steps? Elaborate on how you would choose 'K' and handle potential issues like empty clusters or convergence.
Here's an implementation of the K-means clustering algorithm from scratch in Python. It includes initialization, assignment, and update steps, along with explanations on how to choose 'K' and handle potential issues.
import numpy as np
class KMeans:
def __init__(self, k=3, max_iters=100, random_state=None):
self.k = k
self.max_iters = max_iters
self.random_state = random_state
self.centroids = None
def fit(self, data):
if self.random_state is not None:
np.random.seed(self.random_state)
# 1. Initialize centroids randomly
self.centroids = data[np.random.choice(data.shape[0], self.k, replace=False)]
for _ in range(self.max_iters):
# 2. Assign data points to the nearest centroid
clusters = self._assign_clusters(data)
# 3. Update centroids
new_centroids = self._update_centroids(data, clusters)
# Check for convergence
if np.allclose(self.centroids, new_centroids):
break
self.centroids = new_centroids
def _assign_clusters(self, data):
clusters = [[] for _ in range(self.k)]
for i, point in enumerate(data):
distances = [np.linalg.norm(point - centroid) for centroid in self.centroids]
cluster_idx = np.argmin(distances)
clusters[cluster_idx].append(i)
return clusters
def _update_centroids(self, data, clusters):
new_centroids = np.zeros_like(self.centroids)
for idx, cluster in enumerate(clusters):
if cluster:
new_centroids[idx] = np.mean(data[cluster], axis=0)
else:
# Handle empty cluster by reinitializing the centroid
new_centroids[idx] = data[np.random.choice(data.shape[0])]
return new_centroids
def predict(self, data):
predictions = []
for point in data:
distances = [np.linalg.norm(point - centroid) for centroid in self.centroids]
cluster_idx = np.argmin(distances)
predictions.append(cluster_idx)
return np.array(predictions)
# Example Usage
data = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])
kmeans = KMeans(k=2, max_iters=100, random_state=42)
kmeans.fit(data)
predictions = kmeans.predict(data)
print("Centroids:", kmeans.centroids)
print("Predictions:", predictions)
Initialization:
__init__
method initializes the K-means class with the number of clusters k
, the maximum number of iterations max_iters
, and a random_state
for reproducibility.centroids
are initialized to None
.Fit Method:
fit
method takes the data as input and performs the K-means clustering.k
data points from the input data.max_iters
times, performing the assignment and update steps.Assignment Step:
_assign_clusters
method assigns each data point to the nearest centroid.Update Step:
_update_centroids
method updates the centroids by calculating the mean of the data points in each cluster.Convergence Check:
Prediction:
predict
method takes new data and predicts the cluster for each data point based on the trained centroids.max_iters
is reached. The multiplication by d
comes from the need to calculate distances in d
dimensional space. In practice i
is limited by max_iters
. However, it's worth noting that this is the average time complexity. In the worst case, the runtime can be much higher if the algorithm doesn't converge quickly._assign_clusters
method, the list of clusters
stores at most n
elements because each element can be in at most 1 cluster.n*d
is significantly larger than k*d
and n
, then the space complexity simplifies to O(n*d).All Data Points are the Same:
k = 1:
k = n (Number of Data Points):
Non-numeric Data:
Large Datasets:
This comprehensive implementation and explanation should provide a solid foundation for understanding and implementing the K-means clustering algorithm from scratch.