Implementing k-Nearest Neighbors Algorithm in Python

Lesson 3

Introduction

Welcome to our exploration of the k-Nearest Neighbors (k-NN) algorithm! This essential machine learning classifier is widely appreciated for its simplicity and effectiveness. This lesson will equip you with a clear understanding of the k-NN algorithm and its elements, including the concept and selection of 'k' as well as distance calculation using the Euclidean metric. We'll proceed to implement a k-NN classifier in Python. Intriguing, isn't it? Let's delve into k-NN!

k-Nearest Neighbors (k-NN) Algorithm

The k-NN algorithm classifies data based on a data point's 'k' nearest neighbors from the training dataset. Consider a fruit classification scenario: if a new data point, or fruit, emerges and 'k' is set to 3, the new fruit is classified based on the majority within its three nearest neighbors. Essentially, k-NN takes advantage of the simplicity of voting to make decisions—the class that receives the most votes wins!

Let's see this in action. Consider this dataset, where we have three fruits of Class 0 and three fruits of Class 1. We also have a query point, which is a fruit we aim to assign a class label to.

The kNN algorithm works on a basic principle: a data point is likely to be in the same category as the data points it is closest to. So, the model will identify the 'k' points nearest to our query point, and these 'k' points will vote on what Class the query should belong to. The class label with the most votes will be assigned to the query point. In this case, the query point will be assigned the Class 0 label.

Note that choosing 'k' significantly impacts our model. A low 'k' might capture more noise in the data, whereas a high 'k' is computationally expensive. Therefore, running tests to identify the optimal 'k' is crucial.

Distance Metrics: Implementing Euclidean Distance in Python

In k-NN, classification is determined by weighing the distance between data points. Euclidean distance is a frequently used metric that calculates the shortest straight-line distance $\sqrt{{(x_1 - x_2)}^2 + {(y_1 - y_2)}^2}$ between two points $(x_1, y_1)$ and $(x_2, y_2)$ in a Euclidean space. This formula, rooted in the Pythagorean theorem, will be implemented next in Python:

Python
1import math
2
3# The 'euclidean_distance' function computes the Euclidean distance between two points
4def euclidean_distance(point1, point2):
5    squares = [(p - q) ** 2 for p, q in zip(point1, point2)] # Calculate squared distance for each dimension
6    return math.sqrt(sum(squares)) # Return the square root of the sum of squares
7
8# Test it
9point1 = (1, 2) # The coordinates of the first point
10point2 = (4, 6) # The coordinates of the second point
11print(euclidean_distance(point1, point2)) # 5.0

This code calculates and outputs the Euclidean distance between point1 and point2.

Implementing k-NN Classification

Next, we will construct our k-NN algorithm. It must compute the distance between the test point and all data points, select the 'k' closest points, and designate the class based on the majority vote.

Python
1from collections import Counter
2
3def k_nearest_neighbors(data, query, k, distance_fn):
4    neighbor_distances_and_indices = []
5    
6    # Compute distance from each training data point
7    for idx, label in enumerate(data):
8        distance = distance_fn(label[:-1], query)
9        neighbor_distances_and_indices.append((distance, idx))
10    
11    # Sort array by distance
12    sorted_neighbor_distances_and_indices = sorted(neighbor_distances_and_indices)
13    
14    # Select k closest data points
15    k_nearest_distances_and_indices = sorted_neighbor_distances_and_indices[:k]
16    
17    # Obtain class labels for those k data points
18    k_nearest_labels = [data[i][1] for distance, i in k_nearest_distances_and_indices]
19
20    # Majority vote
21    most_common = Counter(k_nearest_labels).most_common(1)
22    return most_common[0][0] # Return the label of the class that receives the majority vote

The input training data, query point, 'k', and a distance function are taken in this function, and the assigned class label is returned.

Note that we can pass different distance functions in the algorithms. The most common Euclidean distance is used for points in continuous dimensions (like height), but in some cases, we might want to use different distance functions. For example, the Manhattan distance is used for non-comparable or non-continuous dimensions (like categories).

Using k-NN

Here is how we can assign a class to a test data point using our algorithm:

Python
1# Define the dataset (training set)
2# Each element of the dataset is a tuple (features, label)
3data = [
4    ((2, 3), 0),
5    ((5, 4), 0),
6    ((9, 6), 1),
7    ((4, 7), 0),
8    ((8, 1), 1),
9    ((7, 2), 1)
10]
11query = (5, 3)  # test point
12
13# Perform the classification
14predicted_label = k_nearest_neighbors(data, query, k=3, distance_fn=euclidean_distance)
15print(predicted_label)  # Expected class label is 0

Lesson Summary and Practice

You've successfully navigated the learning curve of the k-NN algorithm, fully grasping its work mechanism, distance functions, and Python implementation! Up next, practice exercises will solidify your grasp of these newly acquired concepts. Keep going and enjoy delving deeper into your Python learning journey!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.