Greetings to aspiring data scientists! Today, we'll unlock the curious world of the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm. Standing out in the clustering landscape, DBSCAN is famous for its resilience to outliers and for eliminating the need for pre-set cluster numbers. This lesson will demystify DBSCAN through a Python-based implementation from scratch.
Let's start by peeling off the layers of DBSCAN. At its core, DBSCAN operates on concepts of density and noise. It identifies clusters as regions of high density separated by lower-density regions. Concurrently, it classifies low-density entities as noise, enhancing its robustness towards outliers. The secret recipe behind DBSCAN? A pair of parameters: Epsilon (Eps
) and Minimum Points (MinPts
), which guide the classification of points into categories of 'core', 'border', or 'outlier'.
With a foundational understanding, it's time to roll up our sleeves and implement DBSCAN from scratch.
We'll create a simple toy dataset using numpy
arrays for the first hands-on task. This dataset represents a collection of points on a map that we'll be clustering.
Python1data_points = np.array([ 2 [1.2, 1.9], [2.1, 2], [2, 3.5], [3.3, 3.9], [3.2, 5.1], 3 [8.5, 7.9], [8.1, 7.8], [9.5, 6.5], [9.5, 7.2], [7.7, 8.6], 4 [6.0, 6.0] 5])
Next, we'll devise a function to calculate the Euclidean distance between the data points. The function uses numpy's linalg.norm
to evaluate this distance, which reflects the shortest possible distance between two points.
Python1def euclidean_distance(a, b): 2 return np.linalg.norm(a - b, axis=-1)
This function evaluates the Euclidean distance between points and using the formula:
where and .
Armed with a dataset and the Euclidean distance function, we are prepared to implement DBSCAN.
We will use the following labels:
0
is noise or outlier data points, not belonging to any cluster.1
is the data points in the first identified cluster.2
is the data points in the second identified cluster.
Our function initially labels each point as an outlier. It then verifies if each point has at least MinPts
within an Eps
radius. If this condition is satisfied, the point qualifies as a core point. The code block below demonstrates these steps.
Python1def dbscan(data, Eps, MinPt): 2 point_label = [0] * len(data) 3 # Initialize list to maintain count of surrounding points within radius Eps for each point. 4 point_count = [] 5 core = [] 6 noncore = [] 7 8 # Check for each point if it falls within the Eps radius of point at index i 9 for i in range(len(data)): 10 point_count.append([]) 11 for j in range(len(data)): 12 if euclidean_distance(data[i], data[j]) <= Eps and i != j: 13 point_count[i].append(j) 14 15 # If a point has atleast MinPt points within its Eps radius (excluding itself), classify it as a core point, and vice versa 16 if len(point_count[i]) >= MinPt: 17 core.append(i) 18 else: 19 noncore.append(i) 20 ...
With the mechanism to classify points into 'core', 'non-core', and 'outliers' in place, we can now map each unvisited core point and its reachable neighbors to unique clusters identified by an ID.
Python1 ... 2 ID = 1 3 for point in core: 4 # If the point has not been assigned to a cluster yet 5 if point_label[point] == 0: 6 point_label[point] = ID 7 # Create an empty list to hold 'neighbour points' 8 queue = [] 9 for x in point_count[point]: 10 if point_label[x] == 0: 11 point_label[x] = ID 12 # If neighbor point is also a core point, add it to the queue 13 if x in core: 14 queue.append(x) 15 16 # Check points from the queue 17 while queue: 18 neighbours = point_count[queue.pop(0)] 19 for y in neighbours: 20 if point_label[y] == 0: 21 point_label[y] = ID 22 if y in core: 23 queue.append(y) 24 ID += 1 25 26 return point_label
This code iterates over core points and assesses its neighbors for each. If a neighbor has not been assigned to a cluster, it's assigned to the current core point's cluster. Core points among these neighbors are put in a queue to repeat the same process. Once all points in a cluster are labeled, they go to the next cluster. The final output lists all points with their respective cluster IDs.
With the DBSCAN function ready, let's test it with our toy dataset and visualize the result using matplotlib.
Python1labels = dbscan(data_points, 2, 2) 2 3for i in range(len(labels)): 4 if labels[i] == 1: 5 plt.scatter(data_points[i][0], data_points[i][1], s=100, c='r') 6 elif labels[i] == 2: 7 plt.scatter(data_points[i][0], data_points[i][1], s=100, c='g') 8 else: 9 plt.scatter(data_points[i][0], data_points[i][1], s=100, c='b') 10 11plt.show()
Here is the resulting plot. Red and green dots represent two separate clusters. Blue dots are outliers.
Colors in the plot represent different clusters, showcasing how changing Eps and MinPts influences the clustering. The strength of DBSCAN lies in its ability to group densities, as vividly demonstrated in the plot.
Well done! You've braved the challenging world of DBSCAN, mastering its theory and Python-based implementation. What's next? A set of practice exercises is designed to reinforce your newly acquired knowledge. Remember, practice makes perfect! So, don your coding hat and brace yourself for a fascinating ride into Python practice!