Welcome to our exploration of the k-Nearest Neighbors (k-NN) algorithm! This essential machine learning classifier is widely appreciated for its simplicity and effectiveness. This lesson will equip you with a clear understanding of the k-NN algorithm and its elements, including the concept and selection of 'k' as well as distance calculation using the Euclidean metric. We'll proceed to implement a k-NN classifier in Python. Intriguing, isn't it? Let's delve into k-NN!
The k-NN algorithm classifies data based on a data point's 'k' nearest neighbors from the training dataset. Consider a fruit classification scenario: if a new data point, or fruit, emerges and 'k' is set to 3, the new fruit is classified based on the majority within its three nearest neighbors. Essentially, k-NN takes advantage of the simplicity of voting to make decisions—the class that receives the most votes wins!
Let's see this in action. Consider this dataset, where we have three fruits of Class 0
and three fruits of Class 1
. We also have a query point, which is a fruit we aim to assign a class label to.
The kNN algorithm works on a basic principle: a data point is likely to be in the same category as the data points it is closest to. So, the model will identify the 'k' points nearest to our query point, and these 'k' points will vote on what Class the query should belong to. The class label with the most votes will be assigned to the query point. In this case, the query point will be assigned the Class 0
label.
Note that choosing 'k' significantly impacts our model. A low 'k' might capture more noise in the data, whereas a high 'k' is computationally expensive. Therefore, running tests to identify the optimal 'k' is crucial.
In k-NN, classification is determined by weighing the distance between data points. Euclidean distance is a frequently used metric that calculates the shortest straight-line distance between two points and in a Euclidean space. This formula, rooted in the Pythagorean theorem, will be implemented next in Python:
Python1import math 2 3# The 'euclidean_distance' function computes the Euclidean distance between two points 4def euclidean_distance(point1, point2): 5 squares = [(p - q) ** 2 for p, q in zip(point1, point2)] # Calculate squared distance for each dimension 6 return math.sqrt(sum(squares)) # Return the square root of the sum of squares 7 8# Test it 9point1 = (1, 2) # The coordinates of the first point 10point2 = (4, 6) # The coordinates of the second point 11print(euclidean_distance(point1, point2)) # 5.0
This code calculates and outputs the Euclidean distance between point1
and point2
.
Next, we will construct our k-NN algorithm. It must compute the distance between the test point and all data points, select the 'k' closest points, and designate the class based on the majority vote.
Python1from collections import Counter 2 3def k_nearest_neighbors(data, query, k, distance_fn): 4 neighbor_distances_and_indices = [] 5 6 # Compute distance from each training data point 7 for idx, label in enumerate(data): 8 distance = distance_fn(label[:-1], query) 9 neighbor_distances_and_indices.append((distance, idx)) 10 11 # Sort array by distance 12 sorted_neighbor_distances_and_indices = sorted(neighbor_distances_and_indices) 13 14 # Select k closest data points 15 k_nearest_distances_and_indices = sorted_neighbor_distances_and_indices[:k] 16 17 # Obtain class labels for those k data points 18 k_nearest_labels = [data[i][1] for distance, i in k_nearest_distances_and_indices] 19 20 # Majority vote 21 most_common = Counter(k_nearest_labels).most_common(1) 22 return most_common[0][0] # Return the label of the class that receives the majority vote
The input training data, query point, 'k', and a distance function are taken in this function, and the assigned class label is returned.
Note that we can pass different distance functions in the algorithms. The most common Euclidean distance is used for points in continuous dimensions (like height), but in some cases, we might want to use different distance functions. For example, the Manhattan distance is used for non-comparable or non-continuous dimensions (like categories).
Here is how we can assign a class to a test data point using our algorithm:
Python1# Define the dataset (training set) 2# Each element of the dataset is a tuple (features, label) 3data = [ 4 ((2, 3), 0), 5 ((5, 4), 0), 6 ((9, 6), 1), 7 ((4, 7), 0), 8 ((8, 1), 1), 9 ((7, 2), 1) 10] 11query = (5, 3) # test point 12 13# Perform the classification 14predicted_label = k_nearest_neighbors(data, query, k=3, distance_fn=euclidean_distance) 15print(predicted_label) # Expected class label is 0
You've successfully navigated the learning curve of the k-NN algorithm, fully grasping its work mechanism, distance functions, and Python implementation! Up next, practice exercises will solidify your grasp of these newly acquired concepts. Keep going and enjoy delving deeper into your Python learning journey!