Welcome to our lesson on distance metrics in hierarchical clustering! Today, we will delve into the definition and importance of distance metrics, particularly in hierarchical clustering. You will learn about various types of distance metrics such as Euclidean, Manhattan, and Cosine Distance, and how to implement these in Python. After this, we will examine the impact of these distance measures on the resulting hierarchical clustering.
Distance metrics are essentially measures used in mathematics to calculate the 'distance' between two points. In the context of clustering, we're interested in the distance between data points in our dataset or the distance between clusters of points. We often use metrics like Euclidean distance, Manhattan distance, and Cosine Distance, each with its unique set of characteristics and application scenarios.
The Euclidean distance, often referred to as the straight-line distance between two points in a Euclidean plane, is one of the most commonly used distance metrics in machine learning. Below is the formula and Python implementation of it.
Python code for calculating Euclidean distance:
Python1import math 2 3# Euclidean Distance Metric 4def euclidean_distance(point1, point2): 5 return math.sqrt(sum((point1 - point2)**2))
The Manhattan distance gets its name from the block-like geographical layout of the Manhattan borough of New York City. The Manhattan distance between two points is the sum of the absolute differences of their coordinates. Here's the formula and Python code for calculating Manhattan distance:
Python code for calculating Manhattan distance:
Python1# Manhattan Distance Metric 2def manhattan_distance(point1, point2): 3 return sum(abs(point1 - point2))
The third type of distance metric that we will examine today is Cosine distance, but first let's understand the Cosine similarity. Unlike the other two, Cosine similarity measures the cosine of the angle between two vectors, which can be useful in certain multi-dimensional and text classification problems. From that we can calculate the Cosine Distance as . Here's the formula and Python function for calculating Cosine Distance:
Cosine Similarity:
Cosine Distance:
Python code for calculating Cosine Distance:
Python1import numpy as np 2 3# Cosine Distance Metric 4def cosine_distance(point1, point2): 5 return 1 - np.dot(point1, point2) / (np.linalg.norm(point1) * np.linalg.norm(point2))
Next, we'll see how hierarchical clustering aims to separate the dataset into clusters. The distance metric plays a key role in this process, determining the 'distance' or dissimilarity between data points. Let's tweak the agglomerative hierarchical clustering algorithm to incorporate different distance metrics as a parameter.
For that purpose, we'll tweak the distance matrix calculation function to accept a distance metric as an argument. Here's the Python code for the agglomerative hierarchical clustering algorithm:
Python1# Function to calculate the distance matrix 2def calculate_distance_matrix(X, clusters, distance_metric): 3 # Initialize distance matrix 4 dist_matrix = np.zeros((len(clusters), len(clusters))) 5 # Compute the distance between each pair of clusters 6 for i in range(len(clusters)): 7 for j in range(i+1, len(clusters)): 8 dists = [] 9 for k in clusters[i]: 10 for l in clusters[j]: 11 dists.append(distance_metric(X[k], X[l])) 12 dist_matrix[i, j] = dist_matrix[j, i] = np.mean(dists) 13 return dist_matrix
Similarly, we can tweak the agglomerative clustering function to accept a distance metric as an argument. Here's the Python code for the agglomerative clustering algorithm:
Python1def agglomerative_clustering(X, n_clusters, distance_metric): 2 clusters = [[i] for i in range(len(X))] 3 4 while len(clusters) > n_clusters: 5 # Calculate the distance matrix using the method passed via distance_metric 6 dist_matrix = calculate_distance_matrix(X, clusters, distance_metric) 7 8 min_dist = float('inf') 9 for i in range(len(clusters)): 10 for j in range(i+1, len(clusters)): 11 if dist_matrix[i, j] < min_dist: 12 min_dist = dist_matrix[i, j] 13 idx1, idx2 = i, j 14 clusters[idx1].extend(clusters[idx2]) 15 clusters.pop(idx2) 16 17 labels = np.empty(len(X), dtype=int) 18 for label, cluster in enumerate(clusters): 19 for i in cluster: 20 labels[i] = label 21 22 return labels
Here, we've written a Python function, agglomerative_clustering
, which implements agglomerative hierarchical clustering on a given dataset.
Let's first define the dataset that we will use for the clustering:
Python1import matplotlib.pyplot as plt 2from sklearn.datasets import load_iris 3from sklearn.preprocessing import StandardScaler 4 5# Load the Iris dataset 6dataset = load_iris().data 7 8# Scale the dataset with StandardScaler 9scaler = StandardScaler() 10dataset = scaler.fit_transform(dataset)
Next, we can perform clustering with different distance methods:
Python1# Perform Agglomerative Clustering 2n_clusters = 3 3 4# Euclidean Distance 5labels_euc = agglomerative_clustering(dataset, n_clusters, euclidean_distance) 6 7# Manhattan Distance 8labels_man = agglomerative_clustering(dataset, n_clusters, manhattan_distance) 9 10# Cosine Distance 11labels_cos = agglomerative_clustering(dataset, n_clusters, cosine_distance)
Lastly, we will understand how different distance measures can affect the result of hierarchical clustering. Let's visualize clustering results:
Python1# Plot the results in 3 subplots for each distance metric 2fig, axs = plt.subplots(1, 3, figsize=(15, 5)) 3 4# Euclidean Distance 5axs[0].scatter(dataset[:, 0], dataset[:, 1], c=labels_euc, cmap='viridis') 6axs[0].set_title('Euclidean Distance') 7axs[0].set_xlabel('Sepal Length') 8axs[0].set_ylabel('Sepal Width') 9 10# Manhattan Distance 11axs[1].scatter(dataset[:, 0], dataset[:, 1], c=labels_man, cmap='viridis') 12axs[1].set_title('Manhattan Distance') 13axs[1].set_xlabel('Sepal Length') 14axs[1].set_ylabel('Sepal Width') 15 16# Cosine Distance 17axs[2].scatter(dataset[:, 0], dataset[:, 1], c=labels_cos, cmap='viridis') 18axs[2].set_title('Cosine Distance') 19axs[2].set_xlabel('Sepal Length') 20axs[2].set_ylabel('Sepal Width') 21 22plt.show()
You can visualize the impact of distance metrics, exploring how different distance measures change the clustering outcomes.
Similarly, we can set different distance metrics when using Sklearn's AgglomerativeClustering model. Let's try it out.
Python1from sklearn.cluster import AgglomerativeClustering 2 3# Agglomerative Clustering using sklearn with euclidean distance. 4model_euc = AgglomerativeClustering(n_clusters=n_clusters, metric='euclidean') 5labels_euc = model_euc.fit_predict(dataset) 6 7# Agglomerative Clustering using sklearn with manhattan. Ignore the linkage parameter for now. 8model_man = AgglomerativeClustering(n_clusters=n_clusters, metric='manhattan', linkage='average') 9labels_man = model_man.fit_predict(dataset) 10 11# Agglomerative Clustering using sklearn with cosine distance. Ignore the linkage parameter for now. 12model_cos = AgglomerativeClustering(n_clusters=n_clusters, metric='cosine', linkage='average') 13labels_cos = model_cos.fit_predict(dataset)
If we plot the result the same way as in the custom implementation, we'll have the following result:
Excellent work! You've just mastered the concepts and the importance of distance metrics in hierarchical clustering. You've implemented these metrics in Python and applied them in the agglomerative clustering algorithm. In the end, you studied the impact of these distance metrics on the clustering results. Next, get ready to solidify this knowledge through related exercises!