Welcome! In this lesson, we delve into the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. Using Python's sklearn.cluster library, we'll implement DBSCAN and use Matplotlib to visualize its output. Ready to take a closer look at DBSCAN?
We start by importing the necessary Python libraries: numpy for matrix computation, sklearn.cluster for DBSCAN implementation, sklearn.datasets for synthetic data generation, and Matplotlib for visualizing our clusters.
Python1from sklearn.cluster import DBSCAN 2import numpy as np 3from matplotlib import pyplot as plt 4from sklearn.datasets import make_blobs
Next, using the make_blobs
function from sklearn.datasets, we generate synthetic clusters. This function generates isotropic Gaussian blobs for clustering.
Python1# Configuration options 2num_samples_total = 180 3cluster_centers = [(3,3), (7,7)] 4num_classes = len(cluster_centers) 5 6# Generate clusters 7data, y = make_blobs(n_samples=num_samples_total, centers=cluster_centers, n_features=num_classes, center_box=(0, 1), cluster_std=0.5, random_state=42)
In the above code, we are generating a total of 180 samples distributed among clusters centered around (3,3) and (7,7). center_box=(0,1)
ensures the cluster centers are in the square from [0,0] to [1,1].
Having the data ready, we can now run the DBSCAN algorithm. DBSCAN requires two parameters: eps
and min_samples
. eps
is the maximum distance two samples can be to be considered in the same neighborhood. min_samples
is the number of samples in a neighborhood for a point to be considered a core point.
Python1# Parameters 2epsilon = 0.5 3min_samples = 13 4 5# Run DBSCAN 6dbscan = DBSCAN(eps=epsilon, min_samples=min_samples) 7dbscan.fit(data)
In the code above, we set eps
to 0.5 and min_samples
to 13. We then initialize the DBSCAN object and fit it to our data set.
Now, let's count the number of clusters formed. Noise, or outliers, are assigned the label '-1' by the DBSCAN algorithm.
Python1# Number of clusters in labels, ignoring noise if present. 2n_clusters_ = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0) 3print('Estimated number of clusters: %d' % n_clusters_) # Prints 2
Finally, let's visualize the output of our DBSCAN clustering. Matplotlib is an effective tool to color code each cluster differently and help us better understand the result of our DBSCAN algorithm.
Python1outliers = data[dbscan.labels_ == -1] 2 3# Visualizing the clusters 4colors = plt.cm.rainbow(np.linspace(0, 1, len(np.unique(dbscan.labels_)))) 5 6for i in range(len(data)): 7 plt.plot(data[i][0], data[i][1], 'o', color=colors[int(dbscan.labels_[i] % len(colors))]) 8 9# Visualizing the noise points 10plt.plot(outliers[:, 0], outliers[:, 1], 'o', color='black') 11 12plt.show()
In the plot, each point represents a data point, and the color of the point represents the cluster it belongs to. Here, the rainbow color map is used; each unique cluster is assigned a different color, with each color corresponding to a label assigned by the DBSCAN algorithm. Notice how we separated the noise points with black color:
Congratulations on successfully implementing the DBSCAN algorithm and visualizing the clusters! Get ready for exciting practice exercises to solidify these new concepts and skills further. Practice is the key to mastering any skills. So, plunge into the upcoming exercises! Good luck!