Welcome! In today's lesson, we'll delve into cluster validation. We will interpret and implement the Silhouette Score, and learn how to visualize clusters for validation in Python. All of these concepts form a unified understanding that we'll explore.
Cluster validation, a key step in Cluster Analysis, involves evaluating the quality of the outcomes of the clustering process. Proper validation helps avoid common issues such as overfitting or misjudging the optimal number of clusters.
One metric that plays a crucial role in cluster validation is the Silhouette Score. This measure quantifies the quality of clustering, providing an indication of how well each data point resides within its cluster. The Silhouette Score for a sample is formulated as:
Here, a(i)
represents the average intra-cluster distance, and b(i)
signifies the mean nearest-cluster distance.
Knowing how to interpret the Silhouette Score is essential. The Silhouette Score ranges between -1 and 1. The value of the Silhouette Score has the following interpretation:
-
Score close to 1: The item is well-matched to its own cluster and poorly matched to neighboring clusters. This would be an indication of strong clustering.
-
Score close to 0: The item is on or very close to the decision boundary between two neighboring clusters. The data point is right at the boundary of the clusters. It's not distinctly in one cluster or another. Here, our clustering model is uncertain about the assignment of these points.
-
Score close to -1: The item is mismatched to its own cluster and matched to a neighboring cluster. This case indicates that we've likely assigned a point to the wrong cluster, as it is closer to the neighboring cluster than its own.
It would be ideal that all objects had a Silhouette Score of 1, but in practice, itβs almost impossible.
Firstly, the function dist(a, b)
calculates the Euclidean distance between two points a
and b
.
Python1import numpy as np 2 3def euclidean_distance(a, b): 4 # Calculate the Euclidean distance between points a and b. 5 return np.sqrt(np.sum((np.array(a) - np.array(b)) ** 2))
The function calculate_a(point, cluster)
calculates the a(i)
for a point:
Python1import numpy as np 2 3def calculate_a(point, cluster): 4 # Calculate the average distance from 'point' to other points in the same cluster. 5 if len(cluster) <= 1: 6 return 0 7 distances = [euclidean_distance(point, other) for other in cluster if not np.array_equal(point, other)] 8 return sum(distances) / (len(cluster) - 1)
The function calculate_b(point, cluster)
calculates the b(i)
for a point:
Python1def calculate_b(point, clusters): 2 # Calculate the lowest average distance from 'point' to points in other clusters. 3 min_average_distance = float('inf') 4 for cluster in clusters: 5 # Check if point is in the current cluster by comparing all elements 6 if any(np.array_equal(point, other) for other in cluster): 7 continue 8 distances = [euclidean_distance(point, other) for other in cluster] 9 average_distance = sum(distances) / len(cluster) 10 if average_distance < min_average_distance: 11 min_average_distance = average_distance 12 return min_average_distance
Finally, silhouette_score(points, labels)
determines the silhouette score for each data point.
Python1from collections import defaultdict 2 3def custom_silhouette_score(points, labels): 4 # Group points by cluster label. 5 clusters = defaultdict(list) 6 for point, label in zip(points, labels): 7 clusters[label].append(point) 8 9 # Convert clusters to a list for easier access. 10 cluster_list = list(clusters.values()) 11 12 # Calculate silhouette score for each point. 13 scores = [] 14 for point, label in zip(points, labels): 15 a = calculate_a(point, clusters[label]) 16 b = calculate_b(point, cluster_list) 17 score = (b - a) / max(a, b) if max(a, b) > 0 else 0 18 scores.append(score) 19 20 # Return the average silhouette score. 21 return sum(scores) / len(scores)
Now, let's observe the implementation of our functions using Iris dataset. We'll calculate the Silhouette Score for the KMeans clustering model. For that let's first do the clustering and visualize the clusters:
Python1from sklearn.cluster import KMeans 2import matplotlib.pyplot as plt 3from sklearn import datasets 4 5X = datasets.load_iris().data 6 7# Fit the KMeans model 8kmeans_model = KMeans(n_clusters=3, random_state=0, n_init=10).fit(X) 9 10# Plot the clusters 11plt.scatter(X[:, 0], X[:, 1], c=kmeans_model.labels_) 12plt.show()
The plot will show the clusters formed by the KMeans model as follows (Note, that the plot might be different due to the randomness and library versions.):
Now, let's calculate the Silhouette Score using our custom implementation:
Python1# Calculate and print the average silhouette score. 2average_score = custom_silhouette_score(X, kmeans_model.labels_) 3print(f"Silhouette Score (Custom): {average_score}") # ~0.55
Now, let's explore how we can calculate the Silhouette score using the Scikit-learn library, commonly known as sklearn.
To compute the Silhouette score in sklearn, the silhouette_score
function from sklearn.metrics
module is used. It requires three inputs: the data points, their predicted cluster labels and the metric for calculating the distance. Here's how to use it:
Python1from sklearn.metrics import silhouette_score 2 3# Calculate Silhouette Score using sklearn 4score = silhouette_score(X, kmeans_model.labels_, metric='euclidean') 5 6# Print the score 7print("Silhouette score (sklearn): ", score) # ~0.55
Here, the Euclidean metric is used to measure the distance between points. You can replace 'euclidean'
with other supported metrics like 'manhattan'
, 'cosine'
, etc., based on your needs.
Use the above code as a template to compute the Silhouette score for your clustering tasks in the sklearn library. The convenience of using Scikit-learn expands, with it providing extensive utilities for most clustering algorithms.
Great job! We've successfully covered the theory of cluster validation, the mathematics and practical application of the Silhouette score, and delved into visualizing clusters. Now, prepare for some practical exercises to solidify your understanding and boost your confidence. Happy learning!