Tuning DBSCAN Parameters

Lesson 3

Introduction

Welcome! Today, we'll delve deeper into Density-Based Spatial Clustering of Applications with Noise (DBSCAN) parameters. In particular, we'll explore one crucial yet often overlooked parameter - the 'metric' parameter, which defines the distance function used in the clustering.

Generation of Data

Let's start by generating some sample data for our experiment. We'll use the make_moons function from sklearn.datasets to generate some non-linearly separable data:

Python
1from sklearn.datasets import make_moons
2X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)

This code generates 200 data points which form two crescent-shaped groups. This type of data is often used to illustrate the capabilities of algorithms like DBSCAN that can capture complex cluster shapes.

Revisiting Distance Metrics

Before continue, let's quickly revisit what we mean by distance in the context of DBSCAN algorithm. In DBSCAN, the definition of what makes data points "neighbors" is fundamental to the functioning of this algorithm, and this definition is rooted in our concept of distance.

We mostly use the Euclidean distance for finding neighbors, but other distance metrics can be employed as per the problem context. Below is a brief reminder of the common distance metrics:

Euclidean Distance: The Euclidean distance between points P1:P(p1,q1) and P2:P(p2,q2) in a 2D space is $\sqrt{(p2-p1)^2 + (q2-q1)^2}$ , which is based on the Pythagorean theorem.

Manhattan Distance: Used mainly for high dimensional vectors, the Manhattan distance between two points is the sum of the absolute differences of their coordinates. For example, the Manhattan distance between P1:P(p1,q1) and P2:P(p2, q2) is $|p2 - p1| + |q2 - q1|$ .

Cosine Distance: Used for understanding the angle between two vectors, the cosine distance is a measure of similarity between two vectors of an inner product space. It's calculated through the dot product of the vectors divided by the product of the magnitudes of both vectors.

Let's now proceed with configuring the DBSCAN with different distance metrics.

DBSCAN with Different Distance Metrics

DBSCAN uses a distance function to discover neighbors around a given data point. By changing this function, we can significantly influence the results of our clustering. Let's fit DBSCAN to our data using three popular distances: Manhattan, Euclidean, and Cosine.

Firstly, we'll import the DBSCAN method from sklearn.cluster. Then, we set up and fit DBSCAN models using different metrics:

Python
1from sklearn.cluster import DBSCAN
2
3dbscan1 = DBSCAN(eps=0.2, min_samples=5, metric='manhattan')
4dbscan1.fit(X)
5
6dbscan2 = DBSCAN(eps=0.2, min_samples=5, metric='euclidean')
7dbscan2.fit(X)
8
9dbscan3 = DBSCAN(eps=0.2, min_samples=5, metric='cosine')
10dbscan3.fit(X)

The Euclidean metric calculates the straight-line distance between two points, the Manhattan metric measures the sum of absolute differences, and the Cosine metric calculates the cosine of the angle between two points.

Visualizing the Results

Now that we have our models, let's visualize how different distance metrics affect the DBSCAN results:

Python
1import matplotlib.pyplot as plt
2plt.figure(figsize=(15, 5))
3plt.subplot(131)
4plt.scatter(X[:, 0], X[:, 1], c=dbscan1.labels_)
5plt.title('DBSCAN with Manhattan distance')
6
7plt.subplot(132)
8plt.scatter(X[:, 0], X[:, 1], c=dbscan2.labels_)
9plt.title('DBSCAN with Euclidean distance')
10
11plt.subplot(133)
12plt.scatter(X[:, 0], X[:, 1], c=dbscan3.labels_)
13plt.title('DBSCAN with Cosine distance')
14plt.show()

In these scatterplots, each point represents a sample, with the color indicating the cluster assigned by DBSCAN:

The 'algorithm' Parameter in DBSCAN

Another important parameter in DBSCAN is the 'algorithm' parameter. This parameter determines the algorithm to be applied to predict the nearest neighbors:

'auto' leaves the decision algorithm to be decided by scikit-learn.
'ball_tree' implements ball tree from spatial indexing for neighbor search.
'kd_tree' implements KDTree for neighbor search.
'brute' implements brute-force search.

Each algorithm has its strengths. 'ball_tree' and 'kd_tree' are efficient for large datasets, while 'brute' works well for smaller sets. The algorithm set to 'auto' lets scikit-learn decide the best option. Note, that some algorithms and metrics are incompatible; for instance, the cosine metric cannot be used with the kd_tree algorithm.

Lesson Summary and Practice

Great job! You've understood how different parameters influence the DBSCAN algorithm, exploring both the 'metric' and 'algorithm' parameters. Keep experimenting with these parameters and observing the results, to bully grasp their roles. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.