Lesson 3
Mastering K-means Clustering and the Rand Index with Python
Introduction and Overview

Welcome back! In this lesson, we're seeking a more in-depth understanding of the K-means clustering algorithm by using a straightforward 2D dataset. We'll explore its implementation and evaluate its performance using a well-known measure of clustering accuracy: the Rand Index.

Understanding the Rand Index

As we progress, we delve into the Rand Index, an external cluster validation measure that determines the similarity between two clustering structures. The Rand Index accounts for all pairs of samples and counts pairs that are assigned in the same or different clusters in the predicted and true clustering.

RI=TP+TNTP+FP+FN+TNRI = \frac{TP + TN} {TP+FP+FN + TN}

Where:

  • TPTP (True Positive) is the number of data pairs that are in the same group for both true and predicted labels.
  • FPFP (False Positive) is the number of data pairs that are in the same group for predicted labels but not the true labels.
  • FNFN (False Negative) is the number of data pairs that are in the same group for the true labels but not in the predicted labels.
  • TNTN (True Negative) is the number of data pairs that are in the same group for both true and predicted labels.

The Rand Index value will be between 0 (indicating that the clusters are completely dissimilar) and 1 (indicating that the clusters are identical). As mentioned earlier, the Rand Index can sometimes be overly optimistic, predicting random labels. Despite this, it remains a valuable tool for providing an objective evaluation of our K-means algorithm's performance.

Rand Index vs Adjusted Rand Score

Now, let's discuss an important distinction: the difference between the Rand Index and the Adjusted Rand Score. While the Rand Index gives an absolute measure of the similarity between two data samples, it doesn't take into account the chance groupings that might occur. In other words, the Rand Index may yield a high value due to randomness in the dataset, which is certainly not how we want to evaluate the performance of our algorithm.

The Adjusted Rand Score corrects the Rand Index by taking into account the expected similarity of two random data samples. The Adjusted Rand Score is given by:

ARI=RIExpected_RIMax_RIExpected_RIARI = \frac {RI - Expected\_RI} {Max\_RI - Expected\_RI}

Where:

  • RIRI is the Rand Index of the dataset.
  • Expected_RIExpected\_RI is the expected RI on a set of random clusters.
  • Max_RIMax\_RI is the maximum possible value of the RI.

A high Adjusted Rand Score shows that the clustering is not due to randomness, but due to a meaningful grouping in the dataset. The Adjusted Rand Score, therefore, provides a more robust measure for comparing different clustering algorithms.

While both metrics serve the purpose of comparing two data clusters, always remember:

  • Rand Index may give a high score due to chance groupings.
  • Adjusted Rand Score accounts for the chance groupings, providing a score that truly reflects the similarity between the two clusters."
Evaluating K-means with the Adjusted Rand Score using sklearn

Now that we have learned about the rand index and adjusted rand score it's also beneficial to familiarize ourselves with some of the libraries that provide similar functionality. sklearn, short for Scikit-learn, is a free machine learning library for Python. It features various algorithms like support vector machines, random forests, and k-neighbours, and it also supports Python numerical and scientific libraries like NumPy and SciPy.

In sklearn, the function adjusted_rand_score computes the Adjusted Rand Score of a clustering result. Let's modify our code to use this function:

Python
1from sklearn import metrics 2 3# Calculate the Adjusted Rand Score using sklearn's function 4ri = metrics.adjusted_rand_score(true_labels, labels) 5 6print("Adjusted Rand Score using sklearn function: ", ri)

In the above snippet, we import metrics from sklearn and use adjusted_rand_score to compute the Adjusted Rand Score. The inputs to the function are the true labels and the labels predicted by K-means. The function returns a floating-point number representing the Adjusted Rand Score of the predicted clusters.

Just like in rand index calculation, a higher Adjusted Rand Score means that our K-means algorithm has done a great job clustering.

Full Implementation: K-means and Evaluating with the Adjusted Rand Score using sklearn

With all the pieces at hand, let's put everything together. We'll perform K-means clustering on our toy dataset using sklearn's KMeans function, then evaluate the results using the adjusted Adjusted Rand Score from sklearn. Here's how to do that:

First we initialize the data and perform clustering:

Python
1import numpy as np 2import matplotlib.pyplot as plt 3from sklearn.cluster import KMeans 4from sklearn.metrics import adjusted_rand_score 5np.random.seed(42) 6 7# Define a 2D dataset and true labels for assessment 8features = np.array([[1, 1], [1, 2], [2, 1], [2, 2], 9 [5, 5], [5, 6], [6, 5], [6, 6], 10 [9, 9], [9, 10], [10, 9], [10, 10], 11 [10, 2], [10, 3], [11, 2], [11, 3], 12 [4, 8], [4, 9], [5, 8], [5, 9], 13 [3, 5], [3, 6], [3, 5], [3, 6]]) 14true_labels = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]) 15 16# Define the K-means algorithm and fit the model to the data 17kmeans = KMeans(n_clusters=3, n_init=10, random_state=42) 18kmeans.fit(features)

Now lets calculate the Adjusted Rand Score using sklearn:

Python
1# Obtain the labels 2labels = kmeans.labels_ 3 4# Calculate the Adjusted Rand Score using sklearn 5ri = adjusted_rand_score(true_labels, labels) 6 7# Output the resulting cluster labels, centroids and Adjusted Rand Score 8print("Cluster labels: ", labels) 9print("Centroids: ", kmeans.cluster_centers_) # Prints [[3.0 4.0] [10.5 2.5] [6.8 8.6]]. Might be different depending on the lib versions. 10print("Adjusted Rand Score: ", ri) # Prints ~ 0.12 Might be different depending on the lib versions. 11 12# Visualizing the clusters 13plt.scatter(features[:, 0], features[:, 1], c=labels) 14plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red') 15plt.title("Clusters with Centroids (Red)") 16plt.show()

Here, we have effectively encapsulated our prior discussions on implementing K-means clustering, applying the Adjusted Rand Score, and bringing the insights to life through visual representations. Sklearn's KMeans function simplifies the K-means process into merely defining the model, fitting it to the data, and performing evaluations. By reflecting this streamlined process, the code highlights the importance of understanding essential concepts, navigating libraries, and connecting functions to their origins.

Lesson Summary and Practice

The exploration of the K-means algorithm and the proper use of the Rand Index and Adjusted Rand Score has provided us with significant insights in the realm of machine learning. The next phase will involve practical applications, cementing your understanding of these crucial concepts. Your understanding of these concepts, like the K-means algorithm that we discussed, will improve through multiple iterations. Happy practicing!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.