Lesson 4

Mastering DBSCAN: From Basics to Implementation

Introduction and Overview of DBSCAN

Greetings to aspiring data scientists! Today, we'll unlock the curious world of the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm. Standing out in the clustering landscape, DBSCAN is famous for its resilience to outliers and for eliminating the need for pre-set cluster numbers. This lesson will demystify DBSCAN through a Python-based implementation from scratch.

Let's start by peeling off the layers of DBSCAN. At its core, DBSCAN operates on concepts of density and noise. It identifies clusters as regions of high density separated by lower-density regions. Concurrently, it classifies low-density entities as noise, enhancing its robustness towards outliers. The secret recipe behind DBSCAN? A pair of parameters: Epsilon (Eps) and Minimum Points (MinPts), which guide the classification of points into categories of 'core', 'border', or 'outlier'.

With a foundational understanding, it's time to roll up our sleeves and implement DBSCAN from scratch.

Creating a Toy Dataset

We'll create a simple toy dataset using numpy arrays for the first hands-on task. This dataset represents a collection of points on a map that we'll be clustering.

Python
1data_points = np.array([ 2 [1.2, 1.9], [2.1, 2], [2, 3.5], [3.3, 3.9], [3.2, 5.1], 3 [8.5, 7.9], [8.1, 7.8], [9.5, 6.5], [9.5, 7.2], [7.7, 8.6], 4 [6.0, 6.0] 5])
Distance Function

Next, we'll devise a function to calculate the Euclidean distance between the data points. The function uses numpy's linalg.norm to evaluate this distance, which reflects the shortest possible distance between two points.

Python
1def euclidean_distance(a, b): 2 return np.linalg.norm(a - b, axis=-1)

This function evaluates the Euclidean distance d(a,b)d(a, b) between points aa and bb using the formula:

d(a,b)=(a1b1)2+(a2b2)2++(anbn)2d(a, b) = \sqrt{{(a_1 - b_1)}^2 + {(a_2 - b_2)}^2 + \dots + {(a_n - b_n)}^2}

where a=(a1,a2,,an)a = (a_1, a_2, \dots, a_n) and b=(b1,b2,,bn)b = (b_1, b_2, \dots, b_n).

Setting Initial Point Labels

Armed with a dataset and the Euclidean distance function, we are prepared to implement DBSCAN.

We will use the following labels:

  • 0 is noise or outlier data points, not belonging to any cluster.
  • 1 is the data points in the first identified cluster.
  • 2 is the data points in the second identified cluster.

Our function initially labels each point as an outlier. It then verifies if each point has at least MinPts within an Eps radius. If this condition is satisfied, the point qualifies as a core point. The code block below demonstrates these steps.

Python
1def dbscan(data, Eps, MinPt): 2 point_label = [0] * len(data) 3 # Initialize list to maintain count of surrounding points within radius Eps for each point. 4 point_count = [] 5 core = [] 6 noncore = [] 7 8 # Check for each point if it falls within the Eps radius of point at index i 9 for i in range(len(data)): 10 point_count.append([]) 11 for j in range(len(data)): 12 if euclidean_distance(data[i], data[j]) <= Eps and i != j: 13 point_count[i].append(j) 14 15 # If a point has atleast MinPt points within its Eps radius (excluding itself), classify it as a core point, and vice versa 16 if len(point_count[i]) >= MinPt: 17 core.append(i) 18 else: 19 noncore.append(i) 20 ...
Mapping Points to Clusters

With the mechanism to classify points into 'core', 'non-core', and 'outliers' in place, we can now map each unvisited core point and its reachable neighbors to unique clusters identified by an ID.

Python
1 ... 2 ID = 1 3 for point in core: 4 # If the point has not been assigned to a cluster yet 5 if point_label[point] == 0: 6 point_label[point] = ID 7 # Create an empty list to hold 'neighbour points' 8 queue = [] 9 for x in point_count[point]: 10 if point_label[x] == 0: 11 point_label[x] = ID 12 # If neighbor point is also a core point, add it to the queue 13 if x in core: 14 queue.append(x) 15 16 # Check points from the queue 17 while queue: 18 neighbours = point_count[queue.pop(0)] 19 for y in neighbours: 20 if point_label[y] == 0: 21 point_label[y] = ID 22 if y in core: 23 queue.append(y) 24 ID += 1 25 26 return point_label

This code iterates over core points and assesses its neighbors for each. If a neighbor has not been assigned to a cluster, it's assigned to the current core point's cluster. Core points among these neighbors are put in a queue to repeat the same process. Once all points in a cluster are labeled, they go to the next cluster. The final output lists all points with their respective cluster IDs.

Visualization and Results Interpretation

With the DBSCAN function ready, let's test it with our toy dataset and visualize the result using matplotlib.

Python
1labels = dbscan(data_points, 2, 2) 2 3for i in range(len(labels)): 4 if labels[i] == 1: 5 plt.scatter(data_points[i][0], data_points[i][1], s=100, c='r') 6 elif labels[i] == 2: 7 plt.scatter(data_points[i][0], data_points[i][1], s=100, c='g') 8 else: 9 plt.scatter(data_points[i][0], data_points[i][1], s=100, c='b') 10 11plt.show()

Here is the resulting plot. Red and green dots represent two separate clusters. Blue dots are outliers.

Colors in the plot represent different clusters, showcasing how changing Eps and MinPts influences the clustering. The strength of DBSCAN lies in its ability to group densities, as vividly demonstrated in the plot.

Lesson Summary and Practice

Well done! You've braved the challenging world of DBSCAN, mastering its theory and Python-based implementation. What's next? A set of practice exercises is designed to reinforce your newly acquired knowledge. Remember, practice makes perfect! So, don your coding hat and brace yourself for a fascinating ride into Python practice!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.