Exploring DBSCAN Clustering with Python and scikit-learn

Lesson 2

Introduction and Topic Overview

Welcome! In this lesson, we delve into the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. Using Python's sklearn.cluster library, we'll implement DBSCAN and use Matplotlib to visualize its output. Ready to take a closer look at DBSCAN?

Essential Python Libraries

We start by importing the necessary Python libraries: numpy for matrix computation, sklearn.cluster for DBSCAN implementation, sklearn.datasets for synthetic data generation, and Matplotlib for visualizing our clusters.

Python
1from sklearn.cluster import DBSCAN
2import numpy as np
3from matplotlib import pyplot as plt
4from sklearn.datasets import make_blobs

Creating a Synthetic Dataset

Next, using the make_blobs function from sklearn.datasets, we generate synthetic clusters. This function generates isotropic Gaussian blobs for clustering.

Python
1# Configuration options
2num_samples_total = 180
3cluster_centers = [(3,3), (7,7)]
4num_classes = len(cluster_centers)
5
6# Generate clusters
7data, y = make_blobs(n_samples=num_samples_total, centers=cluster_centers, n_features=num_classes, center_box=(0, 1), cluster_std=0.5, random_state=42)

In the above code, we are generating a total of 180 samples distributed among clusters centered around (3,3) and (7,7). center_box=(0,1) ensures the cluster centers are in the square from [0,0] to [1,1].

Running DBSCAN

Having the data ready, we can now run the DBSCAN algorithm. DBSCAN requires two parameters: eps and min_samples. eps is the maximum distance two samples can be to be considered in the same neighborhood. min_samples is the number of samples in a neighborhood for a point to be considered a core point.

Python
1# Parameters
2epsilon = 0.5
3min_samples = 13
4
5# Run DBSCAN
6dbscan = DBSCAN(eps=epsilon, min_samples=min_samples)
7dbscan.fit(data)

In the code above, we set eps to 0.5 and min_samples to 13. We then initialize the DBSCAN object and fit it to our data set.

Now, let's count the number of clusters formed. Noise, or outliers, are assigned the label '-1' by the DBSCAN algorithm.

Python
1# Number of clusters in labels, ignoring noise if present.
2n_clusters_ = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)
3print('Estimated number of clusters: %d' % n_clusters_) # Prints 2

Visualizing DBSCAN Clusters with Matplotlib

Finally, let's visualize the output of our DBSCAN clustering. Matplotlib is an effective tool to color code each cluster differently and help us better understand the result of our DBSCAN algorithm.

Python
1outliers = data[dbscan.labels_ == -1]
2
3# Visualizing the clusters
4colors = plt.cm.rainbow(np.linspace(0, 1, len(np.unique(dbscan.labels_))))
5
6for i in range(len(data)):
7    plt.plot(data[i][0], data[i][1], 'o', color=colors[int(dbscan.labels_[i] % len(colors))])
8    
9# Visualizing the noise points
10plt.plot(outliers[:, 0], outliers[:, 1], 'o', color='black')
11
12plt.show()

In the plot, each point represents a data point, and the color of the point represents the cluster it belongs to. Here, the rainbow color map is used; each unique cluster is assigned a different color, with each color corresponding to a label assigned by the DBSCAN algorithm. Notice how we separated the noise points with black color:

Lesson Summary and Practice Exercises

Congratulations on successfully implementing the DBSCAN algorithm and visualizing the clusters! Get ready for exciting practice exercises to solidify these new concepts and skills further. Practice is the key to mastering any skills. So, plunge into the upcoming exercises! Good luck!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.