Mastering Cluster Validation with Silhouette Scores and Visualization in Python

Lesson 1

Introduction

Welcome! In today's lesson, we'll delve into cluster validation. We will interpret and implement the Silhouette Score, and learn how to visualize clusters for validation in Python. All of these concepts form a unified understanding that we'll explore.

Understanding Cluster Validation and Decoding the Silhouette Score

Cluster validation, a key step in Cluster Analysis, involves evaluating the quality of the outcomes of the clustering process. Proper validation helps avoid common issues such as overfitting or misjudging the optimal number of clusters.

One metric that plays a crucial role in cluster validation is the Silhouette Score. This measure quantifies the quality of clustering, providing an indication of how well each data point resides within its cluster. The Silhouette Score $s(i)$ for a sample $i$ is formulated as:

$s(i) = \frac{b(i) - a(i)}{max\{a(i), b(i)\}}$

Here, a(i) represents the average intra-cluster distance, and b(i) signifies the mean nearest-cluster distance.

Interpreting the Silhouette Score

Knowing how to interpret the Silhouette Score is essential. The Silhouette Score ranges between -1 and 1. The value of the Silhouette Score has the following interpretation:

Score close to 1: The item is well-matched to its own cluster and poorly matched to neighboring clusters. This would be an indication of strong clustering.
Score close to 0: The item is on or very close to the decision boundary between two neighboring clusters. The data point is right at the boundary of the clusters. It's not distinctly in one cluster or another. Here, our clustering model is uncertain about the assignment of these points.
Score close to -1: The item is mismatched to its own cluster and matched to a neighboring cluster. This case indicates that we've likely assigned a point to the wrong cluster, as it is closer to the neighboring cluster than its own.

It would be ideal that all objects had a Silhouette Score of 1, but in practice, it’s almost impossible.

Python Implementation of the Silhouette Score and Visualization of Clusters for Validation

Firstly, the function dist(a, b) calculates the Euclidean distance between two points a and b.

Python
1import numpy as np
2
3def euclidean_distance(a, b):
4    # Calculate the Euclidean distance between points a and b.
5    return np.sqrt(np.sum((np.array(a) - np.array(b)) ** 2))

The function calculate_a(point, cluster) calculates the a(i) for a point:

Python
1import numpy as np
2
3def calculate_a(point, cluster):
4    # Calculate the average distance from 'point' to other points in the same cluster.
5    if len(cluster) <= 1:
6        return 0
7    distances = [euclidean_distance(point, other) for other in cluster if not np.array_equal(point, other)]
8    return sum(distances) / (len(cluster) - 1)

The function calculate_b(point, cluster) calculates the b(i) for a point:

Python
1def calculate_b(point, clusters):
2    # Calculate the lowest average distance from 'point' to points in other clusters.
3    min_average_distance = float('inf')
4    for cluster in clusters:
5        # Check if point is in the current cluster by comparing all elements
6        if any(np.array_equal(point, other) for other in cluster):
7            continue
8        distances = [euclidean_distance(point, other) for other in cluster]
9        average_distance = sum(distances) / len(cluster)
10        if average_distance < min_average_distance:
11            min_average_distance = average_distance
12    return min_average_distance

Finally, silhouette_score(points, labels) determines the silhouette score for each data point.

Python
1from collections import defaultdict
2
3def custom_silhouette_score(points, labels):
4    # Group points by cluster label.
5    clusters = defaultdict(list)
6    for point, label in zip(points, labels):
7        clusters[label].append(point)
8
9    # Convert clusters to a list for easier access.
10    cluster_list = list(clusters.values())
11
12    # Calculate silhouette score for each point.
13    scores = []
14    for point, label in zip(points, labels):
15        a = calculate_a(point, clusters[label])
16        b = calculate_b(point, cluster_list)
17        score = (b - a) / max(a, b) if max(a, b) > 0 else 0
18        scores.append(score)
19
20    # Return the average silhouette score.
21    return sum(scores) / len(scores)

Practical Examples

Now, let's observe the implementation of our functions using Iris dataset. We'll calculate the Silhouette Score for the KMeans clustering model. For that let's first do the clustering and visualize the clusters:

Python
1from sklearn.cluster import KMeans
2import matplotlib.pyplot as plt
3from sklearn import datasets
4
5X = datasets.load_iris().data
6
7# Fit the KMeans model
8kmeans_model = KMeans(n_clusters=3, random_state=0, n_init=10).fit(X)
9
10# Plot the clusters
11plt.scatter(X[:, 0], X[:, 1], c=kmeans_model.labels_)
12plt.show()

The plot will show the clusters formed by the KMeans model as follows (Note, that the plot might be different due to the randomness and library versions.):

Calculating the Silhouette Score with the Custom Implementation

Now, let's calculate the Silhouette Score using our custom implementation:

Python
1# Calculate and print the average silhouette score.
2average_score = custom_silhouette_score(X, kmeans_model.labels_)
3print(f"Silhouette Score (Custom): {average_score}") # ~0.55

Silhouette Score Calculation Using sklearn

Now, let's explore how we can calculate the Silhouette score using the Scikit-learn library, commonly known as sklearn.

To compute the Silhouette score in sklearn, the silhouette_score function from sklearn.metrics module is used. It requires three inputs: the data points, their predicted cluster labels and the metric for calculating the distance. Here's how to use it:

Python
1from sklearn.metrics import silhouette_score
2
3# Calculate Silhouette Score using sklearn
4score = silhouette_score(X, kmeans_model.labels_, metric='euclidean')
5
6# Print the score
7print("Silhouette score (sklearn): ", score) # ~0.55

Here, the Euclidean metric is used to measure the distance between points. You can replace 'euclidean' with other supported metrics like 'manhattan', 'cosine', etc., based on your needs.

Use the above code as a template to compute the Silhouette score for your clustering tasks in the sklearn library. The convenience of using Scikit-learn expands, with it providing extensive utilities for most clustering algorithms.

Lesson Summary and Practice

Great job! We've successfully covered the theory of cluster validation, the mathematics and practical application of the Silhouette score, and delved into visualizing clusters. Now, prepare for some practical exercises to solidify your understanding and boost your confidence. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.