Welcome! Following our exploration of Hierarchical Clustering, our journey today takes us to dendrograms. We regard these visual tools as par excellence because dendrograms illuminate hierarchical clustering in a form that's pleasing to the eye. We will learn to read, analyze, and interpret dendrograms using Python.
Before moving forward, let's familiarize ourselves with another common way to implement hierarchical clustering in Python - using the Scipy library's function linkage()
. The linkage()
function does agglomerative hierarchical clustering. It takes an array of data points and the clustering method as its primary inputs. In our case, the clustering method will be 'ward', which aims to minimize the variance within each cluster.
Here is how we can perform Hierarchical Agglomerative clustering on our cities dataset:
Python1import numpy as np 2from scipy.cluster.hierarchy import linkage 3 4# Data provided 5data = { 6 'city': ['Mexico City', 'Los Angeles', 'Bogota', 'Lima', 'Madrid', 'Berlin', 'Kinshasa', 'Lagos', 'New Delhi', 'Beijing', 'Sydney'], 7 'long': [-99.1332, -118.2437, -74.0721, -77.0428, -3.7038, 13.4050, 15.2663, 3.3792, 77.2090, 116.4074, 151.2093], 8 'lat': [19.4326, 34.0522, 4.7110, -12.0464, 40.4168, 52.5200, -4.4419, 6.5244, 28.6139, 39.9042, -33.8688] 9} 10 11# Convert data to an array of coordinates 12coordinates = np.column_stack((data['long'], data['lat'])) 13 14# Perform hierarchical clustering 15Z = linkage(coordinates, method='ward')
You may wonder how linkage()
compares with AgglomerativeClustering
from the sklearn library. While the underlying concept is the same, there are some differences primarily in their use cases.
linkage()
shines for smaller datasets when you want rapid access to the full dendrogram or specify your custom distance functions. On the other hand, AgglomerativeClustering
is useful for large datasets due to its memory efficiency and flexibility to return a specified number of clusters.
Now that we've computed the hierarchical clustering, it's time to visualize the results. Scipy also provides us with a useful function for this purpose named dendrogram()
.
Python1from scipy.cluster.hierarchy import dendrogram 2import matplotlib.pyplot as plt 3 4# Plotting the dendrogram 5plt.figure(figsize=(10, 8)) 6dendrogram(Z, labels=data['city'], leaf_rotation=90, leaf_font_size=12) 7plt.title('Dendrogram of Cities based on Geographic Coordinates') 8plt.xlabel('Cities') 9plt.ylabel('Distance') 10plt.show()
The result looks as follows:
Now, let's interpret the dendrogram. The dendrogram shows how the cities are clustered based on their geographic coordinates. The height of the dendrogram shows the distance between the clusters. The longer the vertical line, the further apart the clusters are. The horizontal lines show the merging of clusters. The height at which the horizontal line is cut by the vertical line indicates the distance at which the clusters were merged. The dendrogram can help identify which cities are closer to each other geographically based on the clustering. For example, Mexico City
and Los Angeles
are clustered together, indicating that they are closer to each other geographically compared to other cities in the dataset. Similarly, Madrid
and Berlin
are clustered together, indicating that they are closer to each other geographically as well.
Roughly put, the algorithm first clusters the cities in the same country, then clusters the cities in the same continent, and finally clusters all the cities together — hence the dendrogram's structure.
Interpreting a dendrogram involves understanding its axes. The x-axis contains the labels of the individual data points. For our case, these will be the city names. The y-axis represents the distance or dissimilarity between the clusters. Therefore, the higher the horizontal line in the dendrogram (or the longer the link), the larger the distance between clusters.
As we move from bottom to top, we essentially move from individual points to clusters encompassing all data points, and everything in between. If we cut the dendrogram at a specific height, we could specify a fixed number of clusters that we want.
After going through the steps above, you've successfully created, analyzed, and interpreted dendrograms! Now, it's time for you to put your skills to practice. Start by creating dendrograms for different datasets, using different linkage methods, and interpreting them. The more you practice, the better your understanding and interpretation skills of dendrograms will be. Happy practicing!
And there we have it! You've successfully navigated the world of dendrograms and implemented Agglomerative Hierarchical Clustering in Python. A series of exercises is up next to solidify your newfound knowledge. Remember, practice underpins understanding, especially in programming. Keep forging ahead, and enjoy the process!