Welcome to our introduction to Density-Based Clustering and the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm! Clustering techniques group data based on given measures, with traditional techniques like K-Means focusing on distances between data points. However, Density-Based Clustering seeks areas of high density separated by areas of low density, making it adaptive to a variety of cluster shapes and sizes.
Today, we'll look into DBSCAN's workings and apply it to Python. By the end, you'll be proficient in Density-Based Clustering and be ready to use DBSCAN on a real-world clustering problem.
DBSCAN is a density-based clustering algorithm that groups together points based on their density. It has two main parameters: eps
(epsilon) and MinPts
. Here's a brief overview of the algorithm:
0
(unvisited) and create an empty list of clusters.eps
distance.MinPts
neighbors, it's a core point. Expand the cluster by adding all reachable points to the cluster.DBSCAN classifies points into three categories: core points, border points, and noise points. Core points have at least MinPts
neighbors within eps
distance, border points are within eps
distance of a core point, and noise points are neither core nor border points.
To demonstrate clustering with DBSCAN, we'll first generate some synthetic data using the make_blobs
function from sklearn.datasets
. This function generates isotropic Gaussian distributions, which are perfect for forming distinct 'blobs,' or clusters, of data.
Python1from matplotlib import pyplot as plt 2import numpy as np 3import matplotlib.pyplot as plt 4from sklearn.datasets import make_blobs 5 6# Generate synthetic data with two cluster centers and 180 samples 7num_samples_total = 180 8cluster_centers = [(3,3), (7,7)] 9num_classes = len(cluster_centers) 10epsilon = 1.0 11min_samples = 13 12 13data, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5, random_state = 1)
Next, we'll implement the DBSCAN algorithm itself, along with some necessary helper functions. Specifically, we'll be creating a distance function, a function to perform a region query, and a function to grow a cluster.
To implement DBSCAN, we first need to calculate the distance between points. For this, we'll use the Euclidean distance, which forms the basis for many machine learning algorithms. The formula for Euclidean distance between two points x
and y
in a 2D space is:
We'll define a function calculate_distance
to compute this distance:
Python1# Define the distance function 2def calculate_distance(x, y): 3 return np.sqrt(np.sum((x - y) ** 2))
The region_query
function returns all points within a given radius eps
from a point P
. This helps DBSCAN identify densely populated areas.
Python1def region_query(D, P, eps): 2 neighbors = [] 3 for i in range(D.shape[0]): 4 if calculate_distance(D[i], P) < eps: 5 neighbors.append(i) 6 return neighbors
The grow_cluster
function expands discovered clusters by checking the eps
neighbors and including those that meet the MinPts
condition.
Python1def grow_cluster(D, labels, P, NeighborPts, C, eps, MinPts): 2 labels[P] = C 3 i = 0 4 while i < len(NeighborPts): 5 Pn = NeighborPts[i] 6 if labels[Pn] == -1 or labels[Pn] == 0: 7 labels[Pn] = C 8 PnNeighborPts = region_query(D, D[Pn], eps) 9 if len(PnNeighborPts) >= MinPts: 10 NeighborPts += [p for p in PnNeighborPts if labels[p] == 0] 11 i += 1
Next, we will use our helper functions as building blocks for DBSCAN implementation.
Python1def DBSCAN(D, eps, MinPts): 2 C = 0 3 labels = np.zeros(D.shape[0]) 4 for P in range(D.shape[0]): 5 if labels[P] != 0: 6 continue 7 NeighborPts = region_query(D, D[P], eps) 8 if len(NeighborPts) < MinPts: 9 labels[P] = -1 10 else: 11 C += 1 12 grow_cluster(D, labels, P, NeighborPts, C, eps, MinPts) 13 return labels
Finally, we apply DBSCAN to our dataset with defined eps
and MinPts
.
Python1labels = DBSCAN(D=data, eps=0.5, MinPts=5)
We represent grouped and non-grouped (Noise) data points using the matplotlib library. Here, '-1' marks Noise and is colored black
Python1# Separating noise points 2noise_points = data[labels == -1] 3 4# Visualizing the clusters 5colors = plt.cm.rainbow(np.linspace(0, 1, len(np.unique(labels)))) 6 7for i in range(len(data)): 8 plt.plot(data[i][0], data[i][1], 'o', color=colors[int(labels[i] % len(colors))]) 9 10# Visualizing the noise points 11plt.plot(noise_points[:, 0], noise_points[:, 1], 'o', color='black') 12plt.title('Noise points') 13plt.xlabel('Feature 1') 14plt.ylabel('Feature 2') 15 16plt.show()
Each color here represents a different cluster, helping us visualize the result of our DBSCAN run:
And we're done! You dove deep into Density-Based Clustering, implemented DBSCAN, and glimpsed the mathematical equations powering DBSCAN. As we move to the next set of exercises, apply DBSCAN to a variety of datasets and experiment with eps
and MinPts
parameters to observe their impact. Enjoy exploring clusters!