Assessing Hierarchical Clustering Models with Scikit-learn Metrics

Lesson 5

Introduction

Welcome to today's discussion on Hierarchical Clustering. We will be studying its effectiveness using the Silhouette Score, the Davies-Bouldin Index, and Cross-Tabulation Analysis. We will utilize Python's powerful libraries, scikit-learn and pandas, to equip you with practical and useful skills for evaluating clustering models.

Hierarchical Clustering and Scikit-learn Introduction

Scikit-learn is a widely used Python library for machine learning. In this lesson, we will be using its powerful built-in methods, including the silhouette_score and davies_bouldin_score. Additionally, we will implement Hierarchical Clustering from scikit-learn on some data:

Python
1from sklearn.cluster import AgglomerativeClustering
2
3data = [(1.5, 1.7), (1.9, 2.4), (2.0, 1.9), (3.2, 3.2), (3.5, 3.9), (6.0, 6.5)]
4
5clustering = AgglomerativeClustering().fit(data)

This function applies Hierarchical Clustering to our dataset. The formed cluster labels can be accessed via clustering.labels_.

Silhouette Score

The Silhouette Score offers a measure to evaluate the effectiveness of our clustering. This score gauges how similar a point is to its own cluster compared to other clusters. Higher scores indicate better clustering.

We will implement the silhouette_score function from the sklearn library on our data:

Python
1from sklearn.metrics import silhouette_score
2
3s_score = silhouette_score(data, clustering.labels_)
4print(f"Silhouette Score is: {s_score}")  # higher the better

The output provides a single score showing the effectiveness of our clustering.

Davies-Bouldin Index

The Davies-Bouldin index evaluates the average similarity between clusters. It bears an inverse relationship to model performance, meaning that a lower index value indicates a better model.

We will use the davies_bouldin_score function in sklearn as follows:

Python
1from sklearn.metrics import davies_bouldin_score
2
3db_index = davies_bouldin_score(data, clustering.labels_)
4print(f"Davies-Bouldin index is: {db_index}")

The Davies-Bouldin Index thus obtained serves as another measure of our clustering effectiveness.

Visualizing and Assessing Clustered Data

Visualizing the clustered data points provides an intuitive understanding of our clusters. For this, we will use matplotlib along with Cross-Tabulation Analysis using pandas' crosstab method.

Cross-Tabulation Analysis provides an overview of how labels have been clustered together.

Python
1import pandas as pd
2
3cross_tabulation_counts = pd.crosstab(index=clustering.labels_, columns="count")
4print(f"Cross-tabulation counts are: \n{cross_tabulation_counts}")

The resulting table showcases the distribution of data points across our clusters, whereas the scatter plot visualized using matplotlib presents colored data points according to their respective clusters.

Python
1import matplotlib.pyplot as plt
2
3plt.scatter(*zip(*data), c=[{0: 'r', 1: 'b'}[i] for i in clustering.labels_])
4plt.show()

Taken together, these representations provide a clear and direct view of the various clusters formed based on our data:

Summary and Practice

You are now equipped with the skills to apply Silhouette Score, the Davies-Bouldin Index, and Cross-Tabulation Analysis in assessing Hierarchical Clustering results. These tools enable you to confidently interpret and evaluate clustering models. Remember, these skills are applicable beyond Hierarchical Clustering. So, let's continue refining these capabilities through practice. Keep learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.