Understanding DBSCAN: A Guide to Density-Based Clustering in Python

Lesson 1

Introduction to Density-Based Clustering and DBSCAN

Welcome to our introduction to Density-Based Clustering and the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm! Clustering techniques group data based on given measures, with traditional techniques like K-Means focusing on distances between data points. However, Density-Based Clustering seeks areas of high density separated by areas of low density, making it adaptive to a variety of cluster shapes and sizes.

Today, we'll look into DBSCAN's workings and apply it to Python. By the end, you'll be proficient in Density-Based Clustering and be ready to use DBSCAN on a real-world clustering problem.

Understanding DBSAN Algorithm Steps

DBSCAN is a density-based clustering algorithm that groups together points based on their density. It has two main parameters: eps (epsilon) and MinPts. Here's a brief overview of the algorithm:

Initialize: Assign each point a label of 0 (unvisited) and create an empty list of clusters.
Iterate over points: For each point, check if it's visited. If not, mark it as visited and find its neighbors within eps distance.
Expand clusters: If a point has more than MinPts neighbors, it's a core point. Expand the cluster by adding all reachable points to the cluster.
Repeat: Continue until all points are visited.

DBSCAN classifies points into three categories: core points, border points, and noise points. Core points have at least MinPts neighbors within eps distance, border points are within eps distance of a core point, and noise points are neither core nor border points.

Simulating Clustered Data with `make_blobs`

To demonstrate clustering with DBSCAN, we'll first generate some synthetic data using the make_blobs function from sklearn.datasets. This function generates isotropic Gaussian distributions, which are perfect for forming distinct 'blobs,' or clusters, of data.

Python
1from matplotlib import pyplot as plt
2import numpy as np
3import matplotlib.pyplot as plt
4from sklearn.datasets import make_blobs
5
6# Generate synthetic data with two cluster centers and 180 samples
7num_samples_total = 180
8cluster_centers = [(3,3), (7,7)]
9num_classes = len(cluster_centers)
10epsilon = 1.0
11min_samples = 13
12
13data, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5, random_state = 1)

DBSCAN Algorithm and Supporting Functions

Next, we'll implement the DBSCAN algorithm itself, along with some necessary helper functions. Specifically, we'll be creating a distance function, a function to perform a region query, and a function to grow a cluster.

Distance Calculation

To implement DBSCAN, we first need to calculate the distance between points. For this, we'll use the Euclidean distance, which forms the basis for many machine learning algorithms. The formula for Euclidean distance between two points x and y in a 2D space is:

Distance(x, y) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2}

We'll define a function calculate_distance to compute this distance:

Python
1# Define the distance function
2def calculate_distance(x, y):
3    return np.sqrt(np.sum((x - y) ** 2))

Region Query

The region_query function returns all points within a given radius eps from a point P. This helps DBSCAN identify densely populated areas.

Python
1def region_query(D, P, eps):
2    neighbors = []
3    for i in range(D.shape[0]):
4        if calculate_distance(D[i], P) < eps:
5            neighbors.append(i)
6    return neighbors

Growing a Cluster

The grow_cluster function expands discovered clusters by checking the eps neighbors and including those that meet the MinPts condition.

Python
1def grow_cluster(D, labels, P, NeighborPts, C, eps, MinPts):
2    labels[P] = C
3    i = 0
4    while i < len(NeighborPts):
5        Pn = NeighborPts[i]
6        if labels[Pn] == -1 or labels[Pn] == 0:
7            labels[Pn] = C
8            PnNeighborPts = region_query(D, D[Pn], eps)
9            if len(PnNeighborPts) >= MinPts:
10                NeighborPts += [p for p in PnNeighborPts if labels[p] == 0]
11        i += 1

Implementing DBSCAN Algorithm

Next, we will use our helper functions as building blocks for DBSCAN implementation.

Python
1def DBSCAN(D, eps, MinPts):
2    C = 0
3    labels = np.zeros(D.shape[0])
4    for P in range(D.shape[0]):
5        if labels[P] != 0:
6            continue
7        NeighborPts = region_query(D, D[P], eps)
8        if len(NeighborPts) < MinPts:
9            labels[P] = -1
10        else:
11            C += 1
12            grow_cluster(D, labels, P, NeighborPts, C, eps, MinPts)
13    return labels

Applying DBSCAN

Finally, we apply DBSCAN to our dataset with defined eps and MinPts.

Python
1labels = DBSCAN(D=data, eps=0.5, MinPts=5)

Graphical Visualization of DBSCAN Result

We represent grouped and non-grouped (Noise) data points using the matplotlib library. Here, '-1' marks Noise and is colored black

Python
1# Separating noise points
2noise_points = data[labels == -1]
3
4# Visualizing the clusters
5colors = plt.cm.rainbow(np.linspace(0, 1, len(np.unique(labels))))
6
7for i in range(len(data)):
8    plt.plot(data[i][0], data[i][1], 'o', color=colors[int(labels[i] % len(colors))])
9    
10# Visualizing the noise points
11plt.plot(noise_points[:, 0], noise_points[:, 1], 'o', color='black')
12plt.title('Noise points')
13plt.xlabel('Feature 1')
14plt.ylabel('Feature 2')
15
16plt.show()

Each color here represents a different cluster, helping us visualize the result of our DBSCAN run:

Lesson Summary and Practice

And we're done! You dove deep into Density-Based Clustering, implemented DBSCAN, and glimpsed the mathematical equations powering DBSCAN. As we move to the next set of exercises, apply DBSCAN to a variety of datasets and experiment with eps and MinPts parameters to observe their impact. Enjoy exploring clusters!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.