Lesson 1

Welcome to our introduction to Density-Based Clustering and the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm! Clustering techniques group data based on given measures, with traditional techniques like K-Means focusing on distances between data points. However, Density-Based Clustering seeks areas of high density separated by areas of low density, making it adaptive to a variety of cluster shapes and sizes.

Today, we'll look into DBSCAN's workings and apply it to Python. By the end, you'll be proficient in Density-Based Clustering and be ready to use DBSCAN on a real-world clustering problem.

DBSCAN is a density-based clustering algorithm that groups together points based on their density. It has two main parameters: `eps`

(epsilon) and `MinPts`

. Here's a brief overview of the algorithm:

**Initialize**: Assign each point a label of`0`

(unvisited) and create an empty list of clusters.**Iterate over points**: For each point, check if it's visited. If not, mark it as visited and find its neighbors within`eps`

distance.**Expand clusters**: If a point has more than`MinPts`

neighbors, it's a core point. Expand the cluster by adding all reachable points to the cluster.**Repeat**: Continue until all points are visited.

DBSCAN classifies points into three categories: core points, border points, and noise points. Core points have at least `MinPts`

neighbors within `eps`

distance, border points are within `eps`

distance of a core point, and noise points are neither core nor border points.

To demonstrate clustering with DBSCAN, we'll first generate some synthetic data using the `make_blobs`

function from `sklearn.datasets`

. This function generates isotropic Gaussian distributions, which are perfect for forming distinct 'blobs,' or clusters, of data.

Python`1from matplotlib import pyplot as plt 2import numpy as np 3import matplotlib.pyplot as plt 4from sklearn.datasets import make_blobs 5 6# Generate synthetic data with two cluster centers and 180 samples 7num_samples_total = 180 8cluster_centers = [(3,3), (7,7)] 9num_classes = len(cluster_centers) 10epsilon = 1.0 11min_samples = 13 12 13data, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5, random_state = 1)`

Next, we'll implement the DBSCAN algorithm itself, along with some necessary helper functions. Specifically, we'll be creating a distance function, a function to perform a region query, and a function to grow a cluster.

To implement DBSCAN, we first need to calculate the distance between points. For this, we'll use the Euclidean distance, which forms the basis for many machine learning algorithms. The formula for Euclidean distance between two points `x`

and `y`

in a 2D space is:

We'll define a function `calculate_distance`

to compute this distance:

Python`1# Define the distance function 2def calculate_distance(x, y): 3 return np.sqrt(np.sum((x - y) ** 2))`

The `region_query`

function returns all points within a given radius `eps`

from a point `P`

. This helps DBSCAN identify densely populated areas.

Python`1def region_query(D, P, eps): 2 neighbors = [] 3 for i in range(D.shape[0]): 4 if calculate_distance(D[i], P) < eps: 5 neighbors.append(i) 6 return neighbors`

The `grow_cluster`

function expands discovered clusters by checking the `eps`

neighbors and including those that meet the `MinPts`

condition.

Python`1def grow_cluster(D, labels, P, NeighborPts, C, eps, MinPts): 2 labels[P] = C 3 i = 0 4 while i < len(NeighborPts): 5 Pn = NeighborPts[i] 6 if labels[Pn] == -1 or labels[Pn] == 0: 7 labels[Pn] = C 8 PnNeighborPts = region_query(D, D[Pn], eps) 9 if len(PnNeighborPts) >= MinPts: 10 NeighborPts += [p for p in PnNeighborPts if labels[p] == 0] 11 i += 1`

Next, we will use our helper functions as building blocks for DBSCAN implementation.

Python`1def DBSCAN(D, eps, MinPts): 2 C = 0 3 labels = np.zeros(D.shape[0]) 4 for P in range(D.shape[0]): 5 if labels[P] != 0: 6 continue 7 NeighborPts = region_query(D, D[P], eps) 8 if len(NeighborPts) < MinPts: 9 labels[P] = -1 10 else: 11 C += 1 12 grow_cluster(D, labels, P, NeighborPts, C, eps, MinPts) 13 return labels`

Finally, we apply DBSCAN to our dataset with defined `eps`

and `MinPts`

.

Python`1labels = DBSCAN(D=data, eps=0.5, MinPts=5)`

We represent grouped and non-grouped (Noise) data points using the matplotlib library. Here, '-1' marks Noise and is colored black

Python`1# Separating noise points 2noise_points = data[labels == -1] 3 4# Visualizing the clusters 5colors = plt.cm.rainbow(np.linspace(0, 1, len(np.unique(labels)))) 6 7for i in range(len(data)): 8 plt.plot(data[i][0], data[i][1], 'o', color=colors[int(labels[i] % len(colors))]) 9 10# Visualizing the noise points 11plt.plot(noise_points[:, 0], noise_points[:, 1], 'o', color='black') 12plt.title('Noise points') 13plt.xlabel('Feature 1') 14plt.ylabel('Feature 2') 15 16plt.show()`

Each color here represents a different cluster, helping us visualize the result of our DBSCAN run:

And we're done! You dove deep into Density-Based Clustering, implemented DBSCAN, and glimpsed the mathematical equations powering DBSCAN. As we move to the next set of exercises, apply DBSCAN to a variety of datasets and experiment with `eps`

and `MinPts`

parameters to observe their impact. Enjoy exploring clusters!