Lesson 5

Welcome to today's discussion on **Hierarchical Clustering**. We will be studying its effectiveness using the **Silhouette Score**, the **Davies-Bouldin Index**, and **Cross-Tabulation Analysis**. We will utilize Python's powerful libraries, `scikit-learn`

and `pandas`

, to equip you with practical and useful skills for evaluating clustering models.

`Scikit-learn`

is a widely used Python library for machine learning. In this lesson, we will be using its powerful built-in methods, including the `silhouette_score`

and `davies_bouldin_score`

. Additionally, we will implement Hierarchical Clustering from `scikit-learn`

on some data:

Python`1from sklearn.cluster import AgglomerativeClustering 2 3data = [(1.5, 1.7), (1.9, 2.4), (2.0, 1.9), (3.2, 3.2), (3.5, 3.9), (6.0, 6.5)] 4 5clustering = AgglomerativeClustering().fit(data)`

This function applies Hierarchical Clustering to our dataset. The formed cluster labels can be accessed via `clustering.labels_`

.

The Silhouette Score offers a measure to evaluate the effectiveness of our clustering. This score gauges how similar a point is to its own cluster compared to other clusters. Higher scores indicate better clustering.

We will implement the `silhouette_score`

function from the `sklearn`

library on our data:

Python`1from sklearn.metrics import silhouette_score 2 3s_score = silhouette_score(data, clustering.labels_) 4print(f"Silhouette Score is: {s_score}") # higher the better`

The output provides a single score showing the effectiveness of our clustering.

The Davies-Bouldin index evaluates the average similarity between clusters. It bears an inverse relationship to model performance, meaning that a lower index value indicates a better model.

We will use the `davies_bouldin_score`

function in `sklearn`

as follows:

Python`1from sklearn.metrics import davies_bouldin_score 2 3db_index = davies_bouldin_score(data, clustering.labels_) 4print(f"Davies-Bouldin index is: {db_index}")`

The Davies-Bouldin Index thus obtained serves as another measure of our clustering effectiveness.

Visualizing the clustered data points provides an intuitive understanding of our clusters. For this, we will use `matplotlib`

along with Cross-Tabulation Analysis using `pandas'`

`crosstab`

method.

Cross-Tabulation Analysis provides an overview of how labels have been clustered together.

Python`1import pandas as pd 2 3cross_tabulation_counts = pd.crosstab(index=clustering.labels_, columns="count") 4print(f"Cross-tabulation counts are: \n{cross_tabulation_counts}")`

The resulting table showcases the distribution of data points across our clusters, whereas the scatter plot visualized using `matplotlib`

presents colored data points according to their respective clusters.

Python`1import matplotlib.pyplot as plt 2 3plt.scatter(*zip(*data), c=[{0: 'r', 1: 'b'}[i] for i in clustering.labels_]) 4plt.show()`

Taken together, these representations provide a clear and direct view of the various clusters formed based on our data:

You are now equipped with the skills to apply Silhouette Score, the Davies-Bouldin Index, and Cross-Tabulation Analysis in assessing Hierarchical Clustering results. These tools enable you to confidently interpret and evaluate clustering models. Remember, these skills are applicable beyond Hierarchical Clustering. So, let's continue refining these capabilities through practice. Keep learning!