Welcome to today's discussion on Hierarchical Clustering. We will be studying its effectiveness using the Silhouette Score, the Davies-Bouldin Index, and Cross-Tabulation Analysis. We will utilize Python's powerful libraries, scikit-learn
and pandas
, to equip you with practical and useful skills for evaluating clustering models.
Scikit-learn
is a widely used Python library for machine learning. In this lesson, we will be using its powerful built-in methods, including the silhouette_score
and davies_bouldin_score
. Additionally, we will implement Hierarchical Clustering from scikit-learn
on some data:
Python1from sklearn.cluster import AgglomerativeClustering 2 3data = [(1.5, 1.7), (1.9, 2.4), (2.0, 1.9), (3.2, 3.2), (3.5, 3.9), (6.0, 6.5)] 4 5clustering = AgglomerativeClustering().fit(data)
This function applies Hierarchical Clustering to our dataset. The formed cluster labels can be accessed via clustering.labels_
.
The Silhouette Score offers a measure to evaluate the effectiveness of our clustering. This score gauges how similar a point is to its own cluster compared to other clusters. Higher scores indicate better clustering.
We will implement the silhouette_score
function from the sklearn
library on our data:
Python1from sklearn.metrics import silhouette_score 2 3s_score = silhouette_score(data, clustering.labels_) 4print(f"Silhouette Score is: {s_score}") # higher the better
The output provides a single score showing the effectiveness of our clustering.
The Davies-Bouldin index evaluates the average similarity between clusters. It bears an inverse relationship to model performance, meaning that a lower index value indicates a better model.
We will use the davies_bouldin_score
function in sklearn
as follows:
Python1from sklearn.metrics import davies_bouldin_score 2 3db_index = davies_bouldin_score(data, clustering.labels_) 4print(f"Davies-Bouldin index is: {db_index}")
The Davies-Bouldin Index thus obtained serves as another measure of our clustering effectiveness.
Visualizing the clustered data points provides an intuitive understanding of our clusters. For this, we will use matplotlib
along with Cross-Tabulation Analysis using pandas'
crosstab
method.
Cross-Tabulation Analysis provides an overview of how labels have been clustered together.
Python1import pandas as pd 2 3cross_tabulation_counts = pd.crosstab(index=clustering.labels_, columns="count") 4print(f"Cross-tabulation counts are: \n{cross_tabulation_counts}")
The resulting table showcases the distribution of data points across our clusters, whereas the scatter plot visualized using matplotlib
presents colored data points according to their respective clusters.
Python1import matplotlib.pyplot as plt 2 3plt.scatter(*zip(*data), c=[{0: 'r', 1: 'b'}[i] for i in clustering.labels_]) 4plt.show()
Taken together, these representations provide a clear and direct view of the various clusters formed based on our data:
You are now equipped with the skills to apply Silhouette Score, the Davies-Bouldin Index, and Cross-Tabulation Analysis in assessing Hierarchical Clustering results. These tools enable you to confidently interpret and evaluate clustering models. Remember, these skills are applicable beyond Hierarchical Clustering. So, let's continue refining these capabilities through practice. Keep learning!