Mastering K-means Clustering: Selection of Clusters and Centroid Initialization

Lesson 4

Introduction

Greetings! Our journey into K-means clustering deepens as we explore two crucial elements: the selection of the number of clusters and the initialization of centroids. Our aim is to comprehend these aspects and put them into action using Python. Let's move forward!

Choosing Clusters and Initializing Centroids in K-means

The K in K-means signifies the number of clusters. Centroids, the centers of each cluster, are equally significant. Their initial placement in K-means is crucial. Poorly initialized centroids can lead to sub-optimal clustering — a reason why multiple runs with different initial placements are essential. This highlights the importance of choosing both the number of clusters and their initial centroids.

Revising K-means Algorithm

Sklearn's KMeans not only allows us to specify the number of clusters and maximum iterations but also provides an important parameter, init, where we can set the initial centroids to be used.

Let's import the KMeans class from scikit-learn library and see how we can initialize our centroids there.

Python
1from sklearn.cluster import KMeans
2import numpy as np
3from sklearn.datasets import load_iris
4import matplotlib.pyplot as plt
5
6data = load_iris().data
7
8# Initialize the number of clusters and centroids
9num_clusters = 3
10initial_centroids = data[np.random.choice(range(data.shape[0]), num_clusters, replace=False)]
11
12kmeans = KMeans(n_clusters=num_clusters, init=initial_centroids, n_init=1)

In the code above,n_clusters sets the total number of clusters, init is an optional parameter that accepts the initial centroid positions. By using n_init=1, we disable sklearn's built-in multiple runs with different centroid seeds because we want to use our manually initiated centroids.

After defining the KMeans object with our specified parameters, we fit the model to our Iris dataset. kmeans.fit(data) computes K-means clustering using our data and initial centroid positions:

Python
1kmeans.fit(data)
2
3labels = kmeans.labels_
4centroids = kmeans.cluster_centers_
5
6# Visualization
7plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
8plt.scatter(centroids[:, 0], centroids[:, 1], c='red')
9plt.show()

Here, kmeans.labels_ gives us the labels of each point, and kmeans.cluster_centers_ provides the coordinates of cluster centers. Like before, we represent the data points in different clusters by colors, and the centroids are marked in red. See how easy it is to use sklearn's KMeans once we understand the underlying theory! Using Python libraries like this enhances our efficiency and saves time, especially when working on more complex projects.

Understand the Implications

As we've seen, varied initial centroids and different choices for the number of clusters can lead to different results. Sklearn's KMeans with init='k-means++' (default) can be quite helpful in preventing poor initialization leading to inferior clustering; it initializes the centroids to be (generally) distant from each other, leading to better results on average.

Selection of the Number of Clusters

Let's delve deeper into the topic of selecting the number of clusters. The decision of how many clusters to use can drastically influence the outcomes of your K-means clustering algorithm.

Consider two scenarios where we use our earlier introduced dataset but decide to go with two clusters (k=2) in one case and four clusters (k=3) in the other.

Let's run the algorithm for k=2:

Python
1from sklearn.cluster import KMeans
2import matplotlib.pyplot as plt
3import numpy as np
4np.random.seed(42)
5
6# We're using a simple 2D dataset for ease of comparing
7data = np.array([[3, 1], [5, 1], [2, 3], 
8                 [8, 2], [9, 3], [7, 1], 
9                 [15, 15], [13, 16], [14, 14]])
10
11# Initialize the number of clusters and centroids and fit KMeans model
12num_clusters = 2
13kmeans = KMeans(n_clusters=num_clusters, n_init=10, random_state=42)
14kmeans.fit(data)
15
16# Retrieve labels and centroids
17labels2 = kmeans.labels_
18centroids2 = kmeans.cluster_centers_
19print(labels2) # Prints [0 0 0 0 0 0 1 1 1], Might be different depending on the versions of libraries

In this case, the algorithm will group the first six into one cluster and the last three into another.

Now let's see how the clusters change when we run the algorithm for k=3:

Python
1from sklearn.cluster import KMeans
2import matplotlib.pyplot as plt
3import numpy as np
4np.random.seed(42)
5
6# We're using a simple 2D dataset for ease of comparing
7data = np.array([[3, 1], [5, 1], [2, 3], 
8                 [8, 2], [9, 3], [7, 1], 
9                 [15, 15], [13, 16], [14, 14]])
10
11# Initialize the number of clusters and centroids and fit KMeans model
12num_clusters = 3
13kmeans = KMeans(n_clusters=num_clusters, n_init=10, random_state=42)
14kmeans.fit(data)
15
16# Retrieve labels and centroids
17labels3 = kmeans.labels_
18centroids3 = kmeans.cluster_centers_
19print(labels3) # Prints [2 2 2 0 0 0 1 1 1], Might be different depending on the versions of libraries

In this case, the algorithm will group the first three points into one cluster, the next three into a second cluster, and the last three into a third cluster.

Now lets illustrate two clustering side by side:

Python
1# Plot both k=2 and k=3 on the same plot using subplots
2fig, axs = plt.subplots(1, 2, figsize=(10, 5))
3
4axs[0].scatter(data[:, 0], data[:, 1], c=labels2, cmap='viridis')
5axs[0].scatter(centroids2[:, 0], centroids2[:, 1], c='red')
6axs[0].set_title("K-means Clustering with k = 2")
7
8axs[1].scatter(data[:, 0], data[:, 1], c=labels3, cmap='viridis')
9axs[1].scatter(centroids3[:, 0], centroids3[:, 1], c='red')
10axs[1].set_title("K-means Clustering with k = 3")
11
12plt.show()

We will see the following plot:

These examples illustrate the significant role the number of clusters plays in forming the final clusters. We must carefully choose this number to accurately represent the underlying structure of our data. An incorrect number of clusters could lead to overfitting or underfitting, both of which could misrepresent your data.

Initial Centroid Initialization: Potential Pitfalls and Solutions

You may wonder, "Why can the initial centroid placement result in different clustering results?" Well, the K-means algorithm is an iterative procedure that minimizes the within-cluster sum of squares. However, it only guarantees finding a local minimum, not a global one. This implies that different starting positions can lead to distinct clustering outcomes.

To visualize this, imagine you're blindfolded in a hilly region where you're tasked to find the lowest point. By feeling the ground slope, you move downwards. But when there are many valleys (local minima), your starting position influences which valley (local minimum) you'll end up in - and not all valleys are equally deep. Initial centroids in K-means are akin to starting positions.

Fortunately, real-world applications rarely suffer from the infamous K-means' local minima issue. Plus, Python libraries such as scikit-learn go a long way in handling these concerns proficiently. In particular, scikit's KMeans function uses an intelligent initialization technique called "K-Means++" by default. This approach systematically finds a good set of initial centroids, reducing the likelihood of poor clustering due to unlucky centroid initialization. It's worth mentioning that creating an example that demonstrates sensitivity to initial centroid location is not straightforward due to KMeans' clever centroid initialization.

Nonetheless, it’s still good to be aware of the importance of initial centroids, as in more intricate clustering methods, centroid initialization may significantly impact results. We can illustrate this using a custom implementation of KMeans, since it's very basic and, therefore, it's very sensitive to the choice of initial centroids.

To do that, let's first prepare the data for our illustration:

Python
1import numpy as np
2import matplotlib.pyplot as plt
3import pandas as pd
4
5np.random.seed(0)
6
7# Data preparation
8x1 = np.random.normal(loc=5, scale=1, size=(100, 2))
9x2 = np.random.normal(loc=10, scale=2, size=(100, 2))
10x = np.concatenate([x1, x2])

Now, let's revisit our custom implementation from the beginning of this lesson:

Python
1# K-means clustering preparation
2k = 3
3
4def calc_distance(x1, x2):
5    return (sum((x1 - x2)**2))**0.5
6
7def find_closest_centroids(centroids, x):
8    assigned_centroid = []
9    for i in x:
10        distance = []
11        for j in centroids:
12            distance.append(calc_distance(i, j))
13        assigned_centroid.append(np.argmin(distance))
14    return assigned_centroid
15
16def calc_centroids(clusters, x):
17    new_centroids = []
18    new_df = pd.concat([pd.DataFrame(x), pd.DataFrame(clusters, columns=['cluster'])], axis=1)
19    for c in set(new_df['cluster']):
20        current_cluster = new_df[new_df['cluster'] == c][new_df.columns[:-1]]
21        cluster_mean = current_cluster.mean(axis=0)
22        new_centroids.append(cluster_mean)
23    return new_centroids
24
25# K-means clustering function
26def kmeans_clustering(x, initial_centroids):
27    centroids = initial_centroids
28    
29    for i in range(10):
30        get_centroids = find_closest_centroids(centroids, x)
31        centroids = calc_centroids(get_centroids, x)
32
33    return centroids, get_centroids

The above code is the K-Means Clustering implementation, which we used in the previous units with a small difference — we now pass initial centroids as a parameter to our kmeans_clustering function. Now, let's perform clustering with different initial centroids:

Python
1# Initial centroids for two different initializations
2np.random.seed(42)  # Keeping the seed constant for reproducibility
3initial_centroids1 = x[np.random.choice(range(x.shape[0]), size=k, replace=False), :]
4initial_centroids2 = x[np.random.choice(range(x.shape[0]), size=k, replace=False), :]
5
6# Applying K-means clustering with different initial centroids will affect the final clustering
7centroids1, get_centroids1 = kmeans_clustering(x, initial_centroids1)
8centroids2, get_centroids2 = kmeans_clustering(x, initial_centroids2)
9
10# Visualization of the final clustering for both sets of initial centroids
11fig, axs = plt.subplots(1, 2, figsize=(12, 6))
12
13axs[0].scatter(x[:,0], x[:,1], c=get_centroids1)
14axs[0].scatter(np.array(centroids1)[:,0], np.array(centroids1)[:,1], c='red', marker='X')
15axs[0].set_title("First Initialization")
16
17axs[1].scatter(x[:,0], x[:,1], c=get_centroids2)
18axs[1].scatter(np.array(centroids2)[:,0], np.array(centroids2)[:,1], c='red', marker='X')
19axs[1].set_title("Second Initialization")
20
21plt.show()

The visualization below illustrates the significance of initial centroids:

Lesson Summary and Practice

We have journeyed through the principles of choosing clusters and initializing centroids in K-means while traversing the Python universe. Without a doubt, your understanding has deepened. Armed with this new knowledge, you're ready to tackle the exciting exercises that lie ahead. Practice will help you consolidate and master what you've learned in this lesson. Onwards, to more learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.