Hierarchical Clustering: An In-Depth Guide with Python Implementation

Lesson 1

Introduction

Hello and welcome! This lesson serves as your introduction to Hierarchical Clustering, a crucial part of unsupervised machine learning. Hierarchical Clustering is a powerful tool for grouping data based on inherent patterns and shared characteristics. By visually representing the hierarchy of clusters, it provides deep insights into the intricacies and overlapping structures of our data. Let's commence this journey into Hierarchical Clustering!

Agglomerative vs. Divisive Approaches

Hierarchical Clustering operates broadly through two approaches — Agglomerative and Divisive. The Agglomerative technique, also known as 'bottom-up,' begins by treating every data point as a distinct cluster and then merges them until only one cluster remains. Conversely, the Divisive methodology, termed 'top-down,' begins with all data points in a single cluster and splits them progressively until each point forms its cluster.

Agglomerative Clustering Algorithm

One of the most common forms of Hierarchical Clustering is Agglomerative Hierarchical Clustering. It starts with every single object in a single cluster. Then, in each successive iteration, it merges the closest pair of clusters and updates the similarity (or distance) between the newly formed cluster and each old cluster. The algorithm repeats this process until all objects are in a single remaining cluster.

The Agglomerative Hierarchical Clustering involves the following major steps:

Compute the similarity (or distance) matrix containing the distance between each pair of objects in the dataset.
Represent each data object as a singleton cluster.
Repeat steps 4 and 5 until only one cluster remains.
Merge the two closest clusters based on the distances from the distance matrix.
Update the similarity (or distance) matrix to reflect the distance of the newly formed cluster with the remaining clusters in the forest. The output is a tree-like diagram named dendrogram which represents the order and distances (similarity) of merges during the algorithm execution.

Understanding the Distance Matrix

The Distance Matrix is a foundational element in hierarchical clustering that plays a critical role in understanding the relations and proximities between data points. Essentially, it is a square matrix where each element (i, j) represents the distance between the i-th and the j-th points in the dataset. The choice of distance metric (Euclidean, Manhattan, Cosine, etc.) can significantly influence the clustering outcome and is crucial to the coherence of the resulting clusters.

Distance matrix representation in linear algebra is as follows:

D = \begin{bmatrix} 0 & d_{12} & d_{13} & \cdots & d_{1n} \\ d_{21} & 0 & d_{23} & \cdots & d_{2n} \\ d_{31} & d_{32} & 0 & \cdots & d_{3n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ d_{n1} & d_{n2} & d_{n3} & \cdots & 0 \end{bmatrix}

So, for example, D[2, 3] represents the distance between the second and third data points, and D[3, 2] would be the same value since the distance matrix is symmetric.

Here's why the Distance Matrix is indispensable in hierarchical clustering:

Initial Cluster Formation: It provides the initial groundwork for forming clusters by quantifying the closeness or similarity between individual data points or clusters.
Cluster Merging Strategy: It is essential to decide which clusters to merge at each step of the agglomerative clustering process. Clusters with the smallest distances between them are merged, promoting more natural groupings in the data.
Efficiency in Recalculation: After merging clusters, updating the distance matrix (instead of recalculating all distances from scratch) speeds up the clustering process considerably. Different linkage criteria (single, complete, average, etc.) provide various strategies for updating these distances.
Visual Representation and Analysis: The final dendrogram, which visualizes the process of hierarchical clustering, is fundamentally reliant on the distances computed and stored in this matrix. This visualization aids in determining the appropriate number of clusters by identifying significant gaps in the distances at which clusters merge.

Understanding and appropriately calculating the distance matrix is, therefore, a precursor to effectively implementing hierarchical clustering algorithms and extracting meaningful insights from complex datasets.

Custom Agglomerative Clustering implementation in Python with the Iris dataset

Let's proceed to the hands-on implementation of Agglomerative Clustering in Python (using the Iris dataset for our example). We'll use some standard Python libraries — numpy for numerical computations and matplotlib and seaborn for data visualization.

We first setup our environment and initiate our Iris dataset:

Python
1# Utilizing the Iris dataset
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5from sklearn import datasets
6
7iris = datasets.load_iris()
8X = iris.data

Next, we plot our dataset with Seaborn, a library built on matplotlib, which aids in drawing more attractive and informative statistical graphics:

Python
1# Pairplot for the Iris dataset
2sns.pairplot(sns.load_dataset("iris"), hue="species")
3plt.show()

The resulting plot will look like this:

Let's understand the plot:

The diagonal plots show the distribution of each feature in the dataset — for example, the plot at index (0, 0) shows the distribution of sepal length, meaning the x-axis represents the sepal length and the y-axis represents the frequency of the sepal length values.
The scatter plots show the relationship between each pair of features, with different colors representing different species of Iris flowers — for example, the plot at index (0, 1) shows the relationship between the sepal length and sepal width and so on

Having that, we can now formulate our Agglomerative Hierarchical Clustering function. Our function will begin by placing each data point in a separate cluster, calculating the distances between clusters, merging the nearest clusters, and proceeding until one or more specified clusters remain:

For that, let's define a function to calculate the Euclidean distance between two points:

Python
1import numpy as np
2
3# Function to compute Euclidean Distance
4def euc_dist(a, b):
5    return np.sqrt(np.sum((a - b)**2))

Now we can define a separate function that computes the distance matrix:

Python
1# Function to calculate the distance matrix
2def calculate_distance_matrix(X, clusters):
3    # Initialize distance matrix
4    dist_matrix = np.zeros((len(clusters), len(clusters)))
5    # Compute the distance between each pair of clusters
6    for i in range(len(clusters)):
7        for j in range(i+1, len(clusters)):
8            dists = []
9            for k in clusters[i]:
10                for l in clusters[j]:
11                    dists.append(euc_dist(X[k], X[l]))
12            dist_matrix[i, j] = dist_matrix[j, i] = np.mean(dists)
13    return dist_matrix

Finally, we define the main function for Agglomerative Clustering:

Python
1def agglomerative_clustering(X, n_clusters):
2    # Initialize each data point as its own cluster
3    clusters = [[i] for i in range(len(X))]
4    # Continue clustering until the desired number of clusters is reached
5    while len(clusters) > n_clusters:
6        dist_matrix = calculate_distance_matrix(X, clusters) # Calculate the distance matrix
7        # Find the pair of clusters with the minimum distance without np
8        min_dist = float('inf')
9        for i in range(len(clusters)):
10            for j in range(i+1, len(clusters)):
11                if dist_matrix[i, j] < min_dist:
12                    min_dist = dist_matrix[i, j]
13                    idx1, idx2 = i, j
14        clusters[idx1].extend(clusters[idx2]) # Merge these two closest clusters
15        clusters.pop(idx2) # Remove the merged cluster from the list of clusters
16    # Create an array to hold the cluster labels for each data point
17    labels = np.empty(len(X), dtype=int)
18    for label, cluster in enumerate(clusters):
19        for i in cluster:
20            labels[i] = label
21    return labels

Notice that we perform the clustering until the number of clusters reaches the desired value. The function agglomerative_clustering returns an array of cluster labels for each data point.

Now, let's scale our data and apply the Agglomerative Clustering function to the Iris dataset:

Python
1from sklearn.preprocessing import StandardScaler
2
3# Standardizing the features
4scaler = StandardScaler()
5X_std = scaler.fit_transform(X)

Python
1# Perform agglomerative clustering
2y_agg = agglomerative_clustering(X_std, n_clusters=3)
3
4# Visualize the clustering
5plt.scatter(X[:, 0], X[:, 1], c=y_agg, cmap='viridis')
6plt.show()

We will see the following result:

Sklearn's Hierarchical Clustering

Sklearn, Python's esteemed Machine Learning library, provides an optimized, efficient implementation of Hierarchical Clustering through the AgglomerativeClustering class. Let's try it with our Iris dataset:

Python
1from sklearn.cluster import AgglomerativeClustering
2
3# Defining the agglomerative clustering
4agg_cluster = AgglomerativeClustering(n_clusters=3)
5
6# Fit model
7y_agg_sklearn = agg_cluster.fit_predict(X_std)
8
9# Visualizing clusters
10plt.scatter(X_std[:,0], X_std[:, 1], c=y_agg_sklearn)
11plt.show()

The resulting plot will be the following, with some differences to the previous one due to more advanced techniques used in the AgglomerativeClustering class:

Lesson Summary and Practice

Congratulations! You've successfully broken down the nuances of Hierarchical Clustering, focusing on Agglomerative Clustering, and brought Hierarchical Clustering to life using Python. Practice tasks await next, offering the perfect platform to apply acquired concepts and deepen your understanding of Hierarchical Clustering. Carry this momentum forward, and let's forge ahead into the captivating world of Hierarchical Clustering!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.