Mastering K-means Clustering and the Rand Index with Python

Lesson 3

Introduction and Overview

Welcome back! In this lesson, we're seeking a more in-depth understanding of the K-means clustering algorithm by using a straightforward 2D dataset. We'll explore its implementation and evaluate its performance using a well-known measure of clustering accuracy: the Rand Index.

Understanding the Rand Index

As we progress, we delve into the Rand Index, an external cluster validation measure that determines the similarity between two clustering structures. The Rand Index accounts for all pairs of samples and counts pairs that are assigned in the same or different clusters in the predicted and true clustering.

$RI = \frac{TP + TN} {TP+FP+FN + TN}$

Where:

$TP$ (True Positive) is the number of data pairs that are in the same group for both true and predicted labels.
$FP$ (False Positive) is the number of data pairs that are in the same group for predicted labels but not the true labels.
$FN$ (False Negative) is the number of data pairs that are in the same group for the true labels but not in the predicted labels.
$TN$ (True Negative) is the number of data pairs that are in the same group for both true and predicted labels.

The Rand Index value will be between 0 (indicating that the clusters are completely dissimilar) and 1 (indicating that the clusters are identical). As mentioned earlier, the Rand Index can sometimes be overly optimistic, predicting random labels. Despite this, it remains a valuable tool for providing an objective evaluation of our K-means algorithm's performance.

Rand Index vs Adjusted Rand Score

Now, let's discuss an important distinction: the difference between the Rand Index and the Adjusted Rand Score. While the Rand Index gives an absolute measure of the similarity between two data samples, it doesn't take into account the chance groupings that might occur. In other words, the Rand Index may yield a high value due to randomness in the dataset, which is certainly not how we want to evaluate the performance of our algorithm.

The Adjusted Rand Score corrects the Rand Index by taking into account the expected similarity of two random data samples. The Adjusted Rand Score is given by:

$ARI = \frac {RI - Expected\_RI} {Max\_RI - Expected\_RI}$

Where:

$RI$ is the Rand Index of the dataset.
$Expected\_RI$ is the expected RI on a set of random clusters.
$Max\_RI$ is the maximum possible value of the RI.

A high Adjusted Rand Score shows that the clustering is not due to randomness, but due to a meaningful grouping in the dataset. The Adjusted Rand Score, therefore, provides a more robust measure for comparing different clustering algorithms.

While both metrics serve the purpose of comparing two data clusters, always remember:

Rand Index may give a high score due to chance groupings.
Adjusted Rand Score accounts for the chance groupings, providing a score that truly reflects the similarity between the two clusters."

Evaluating K-means with the Adjusted Rand Score using sklearn

Now that we have learned about the rand index and adjusted rand score it's also beneficial to familiarize ourselves with some of the libraries that provide similar functionality. sklearn, short for Scikit-learn, is a free machine learning library for Python. It features various algorithms like support vector machines, random forests, and k-neighbours, and it also supports Python numerical and scientific libraries like NumPy and SciPy.

In sklearn, the function adjusted_rand_score computes the Adjusted Rand Score of a clustering result. Let's modify our code to use this function:

Python
1from sklearn import metrics
2
3# Calculate the Adjusted Rand Score using sklearn's function
4ri = metrics.adjusted_rand_score(true_labels, labels)
5
6print("Adjusted Rand Score using sklearn function: ", ri)

In the above snippet, we import metrics from sklearn and use adjusted_rand_score to compute the Adjusted Rand Score. The inputs to the function are the true labels and the labels predicted by K-means. The function returns a floating-point number representing the Adjusted Rand Score of the predicted clusters.

Just like in rand index calculation, a higher Adjusted Rand Score means that our K-means algorithm has done a great job clustering.

Full Implementation: K-means and Evaluating with the Adjusted Rand Score using sklearn

With all the pieces at hand, let's put everything together. We'll perform K-means clustering on our toy dataset using sklearn's KMeans function, then evaluate the results using the adjusted Adjusted Rand Score from sklearn. Here's how to do that:

First we initialize the data and perform clustering:

Python
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.cluster import KMeans
4from sklearn.metrics import adjusted_rand_score
5np.random.seed(42)
6
7# Define a 2D dataset and true labels for assessment
8features = np.array([[1, 1], [1, 2], [2, 1], [2, 2], 
9                     [5, 5], [5, 6], [6, 5], [6, 6], 
10                     [9, 9], [9, 10], [10, 9], [10, 10],
11                     [10, 2], [10, 3], [11, 2], [11, 3], 
12                     [4, 8], [4, 9], [5, 8], [5, 9], 
13                     [3, 5], [3, 6], [3, 5], [3, 6]])
14true_labels = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2])
15
16# Define the K-means algorithm and fit the model to the data
17kmeans = KMeans(n_clusters=3, n_init=10, random_state=42)
18kmeans.fit(features)

Now lets calculate the Adjusted Rand Score using sklearn:

Python
1# Obtain the labels
2labels = kmeans.labels_
3
4# Calculate the Adjusted Rand Score using sklearn
5ri = adjusted_rand_score(true_labels, labels)
6
7# Output the resulting cluster labels, centroids and Adjusted Rand Score
8print("Cluster labels: ", labels)
9print("Centroids: ", kmeans.cluster_centers_) # Prints [[3.0 4.0] [10.5 2.5] [6.8 8.6]]. Might be different depending on the lib versions.
10print("Adjusted Rand Score: ", ri) # Prints ~ 0.12 Might be different depending on the lib versions.
11
12# Visualizing the clusters
13plt.scatter(features[:, 0], features[:, 1], c=labels)
14plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red')
15plt.title("Clusters with Centroids (Red)")
16plt.show()

Here, we have effectively encapsulated our prior discussions on implementing K-means clustering, applying the Adjusted Rand Score, and bringing the insights to life through visual representations. Sklearn's KMeans function simplifies the K-means process into merely defining the model, fitting it to the data, and performing evaluations. By reflecting this streamlined process, the code highlights the importance of understanding essential concepts, navigating libraries, and connecting functions to their origins.

Lesson Summary and Practice

The exploration of the K-means algorithm and the proper use of the Rand Index and Adjusted Rand Score has provided us with significant insights in the realm of machine learning. The next phase will involve practical applications, cementing your understanding of these crucial concepts. Your understanding of these concepts, like the K-means algorithm that we discussed, will improve through multiple iterations. Happy practicing!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.