Understanding Linkage Criteria in Hierarchical Clustering

Lesson 3

Introduction

Welcome to our lesson on 'Understanding Linkage Criteria in Hierarchical Clustering'. We will delve into specific linkage criteria and their role in hierarchical clustering.

Our main objective is to explain four types of linkage criteria: Single, Complete, Average linkage, and Ward's method. Complementing these understandings, we will also be implementing these linkage criteria using Python with the help of the AgglomerativeClustering class in sklearn.cluster package. Without much ado, let's dive into the journey of hierarchical clustering.

What is linkage and why is it important

Hierarchical clustering is a popular choice for cluster analysis in data mining as it helps visualize the grouping of similar objects to form clusters. In each step of hierarchical clustering, some objects or clusters are linked. These linkages, or merges, are made based on a specific criterion known as 'Linkage Criteria'.

Linkage Criteria determine the distance between clusters, and based on these distances, clusters are merged in each step of hierarchical clustering. Linkage methods define how the distance between clusters is measured, and different methods specify different definitions for the concept of cluster distance.

Therefore, choosing the right linkage method can significantly influence the shape and size of the clusters. Sounds interesting? Let's explore each of these linkage methods!

Single Linkage

Single Linkage, or Minimal Intercluster Method, computes the distance between two clusters as the smallest distance from any point in one cluster to any point in the other cluster. In simpler words, single linkage looks for the shortest link or the nearest neighbor. If we think of each cluster represented as a network's node, single linkage corresponds to the shortest edge from node A to node B.

Although single linkage can provide fine, detailed dendrograms and handle non-elliptical shapes, it can sometimes result in 'chaining,' where clusters may be forced together due to a single close pair of points.

Complete Linkage

On the other end of the spectrum from single linkage is Complete Linkage (or Maximum Intercluster Method). Complete Linkage computes the distance between two clusters as the largest distance from any point in one cluster to any point in another cluster.

In other words, it focuses on the furthest points or the furthest neighbors in the clusters. This method works well with clusters compact in nature. It also tends to avoid 'chaining,' resulting in more balanced, even clusters than single linkage.

Average Linkage

Average Linkage method, as the name suggests, looks for an average compromise. It computes the distance between two clusters as the average distance from all pairs of points, one point in each cluster. This method highlights the general separation between two clusters.

Average linkage tends to be less susceptible to noise and outliers and creates clusters of roughly equal diameters. However, it may favor globally compact clusters over small local minima.

Ward's Method

Ward's Method is distinctly different from the other three linkage methods. Instead of computing direct distances between points in the clusters, it calculates the sum of the squared Euclidean distances to the centroids of the clusters. It tends to create compact clusters of roughly equal size.

Ward's method's primary intention is to minimize the total within-cluster variance. That is, Ward's method seeks to choose the cluster pair to merge such that the increase in the total within-cluster variance is the smallest possible.

Comparing Different Linkage Methods

Finally, let's compare different linkage results using the sample code provided. In this code, we will generate synthetic data using the make_blobs function in sklearn.datasets. We then run Agglomerative Clustering using each linkage method and plot the results.

Python
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.datasets import make_blobs
4from sklearn.cluster import AgglomerativeClustering
5
6# Generate synthetic data
7X, _ = make_blobs(n_samples=150, centers=4, cluster_std=1.2, random_state=42)
8
9# Define the different linkage methods to be compared
10linkage_methods = ['single', 'complete', 'average', 'ward']
11
12# Create subplots with 2 rows and 2 columns and flatten the axs array for easier indexing
13fig, axs = plt.subplots(2, 2, figsize=(10, 10))
14axs = axs.flatten()
15
16# Loop through each linkage method and apply Agglomerative Clustering
17for i, linkage in enumerate(linkage_methods):
18    clustering = AgglomerativeClustering(n_clusters=4, linkage=linkage)
19    labels = clustering.fit_predict(X)
20    # Plot the clusters
21    axs[i].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k')
22    axs[i].set_title(f'Linkage: {linkage}')
23    axs[i].set_xlabel('Feature 1')
24    axs[i].set_ylabel('Feature 2')
25    axs[i].grid(True)
26plt.tight_layout(pad=3.0)
27plt.show()

You can compare the results in the following visualization:

This visual comparison can give you a clear understanding of how different linkage methods can lead to entirely different clustering results.

Choosing the Right Linkage Method

The choice of linkage method in Hierarchical Clustering mainly depends on the problem domain, shape of data, and specific requirements of our clustering task.

Here are some factors to consider when choosing the right linkage method:

Single Linkage: Single Linkage is a good option when we intend to capture insight into the data's fine granularity. It works well when the clusters naturally have a chain-like or non-compact structure. However, it may produce elongated and less balanced clusters due to its sensitivity to outliers and noise.
Complete Linkage: Complete linkage might be the choice when we want more balanced, compact, and well-separated clusters. It is less prone to chaining effect but may break large clusters, leading to less-optimal global solutions.
Average Linkage: Average Linkage strikes a balance between single and complete linkage. It is less sensitive to noise and outliers and usually produces well-defined and balanced clusters. It works great when the clusters are relatively similar in terms of diameters.
Ward's Method: Ward's method seeks to minimize the total within-cluster variance, so it usually generates more uniform-sized, tightly packed clusters. It is a nice choice when we expect clusters to have similar sizes or to be compact and isolated. However, Ward's method could be biased towards globular clusters.

Remember, there's no universal best choice for all datasets and scenarios. The choice of the linkage method should be based on the nature of the data, problem-specific requirements, and the results of our clustering, validated by domain knowledge. Experimenting with different linkage methods and evaluating their performances can also be helpful to find the right method.

As you advance further into hierarchical clustering and conduct more practical experiments, you'll be able to make more informed decisions regarding the choice of linkage method.

Lesson Summary and Practice

In this lesson, we dove deep into the four fundamental types of linkage criteria: Single, Complete, Average linkage, and Ward's method. We learned their working mechanisms, implemented them in Python, and understood their distinct roles in hierarchical clustering. We observed the effects of these linkage methods in real time through a visual comparison.

Upcoming practice exercises will test and cement your newly gained wisdom about these crucial concepts in hierarchical clustering. Keep learning, and enjoy the journey!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.