K-Nearest Neighbors (KNN) Basics

Lesson 3

Lesson Introduction

Welcome to the lesson on K-Nearest Neighbors (KNN)! Today, we’ll explore this simple yet intuitive algorithm. KNN is used for classification and regression tasks. Our goal is to understand how KNN works and implement it in Python using Scikit-Learn. By the end, you'll be able to classify data points based on their features.

K-Nearest Neighbors (KNN) Basics

What is KNN? Imagine identifying a fruit as an apple or an orange. Instead of using a dictionary, you ask nearby people for their opinions. The majority wins. This is the idea behind KNN, classifying a data point based on its nearest neighbors.

Let's take a look at an example:

In this image, we see a target point (black cross) that we want to predict the class for. This target point's three nearest neighbors are two red points and one green point. As the majority of the neighbors are red points, the target point will be also classified as a red point.

Why use KNN? It's easy to understand and implement. It is useful in recommending products, recognizing medical patterns, etc.

Loading Dataset

Let's load the Iris dataset, which contains information about different flowers. Here's how we do it using Scikit-Learn:

Python
1from sklearn.datasets import load_iris
2from sklearn.model_selection import train_test_split
3
4
5# Load the Iris dataset
6X, y = load_iris(return_X_y=True)
7
8# Splitting the dataset into training and testing sets
9X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
10

This code loads the Iris dataset and splits it into features X and labels y. The dataset contains various information about flowers, including Sepal Length, Sepal Width, Petal Length and Petal Width. Our goal is to predict the type of the flower, which is one of the following: Setosa, Versicolour, and Virginica.

Note, that as we predict three classes instead of two, this is not a binary classification. Now we are working with a multiclass classification. Luckily for us, the KNN-classifier is perfectly suitable for this type of tasks. Decision trees can also be used for it, as we saw in the previous lesson.

One important thing to know is that the logistic regression is not suitable for multiclass classification in its original form. However, it can be adapted for this type of task using techniques like One-vs-Rest (OvR) or Softmax Regression (Multinomial Logistic Regression).

Using the KNN Classifier

Now, let's initialize and fit our KNN classifier. KNN classifies a data point based on the majority class among its nearest neighbors. We’ll start with k=3, meaning we will check the three nearest neighbors when making a prediction.

Here’s how to fit a KNN classifier:

Python
1from sklearn.neighbors import KNeighborsClassifier
2
3# Initialize the KNN classifier with 3 neighbors
4knn_clf = KNeighborsClassifier(n_neighbors=3)
5
6# Fit the KNN classifier
7knn_clf.fit(X_train, y_train)

This initializes the KNeighborsClassifier with 3 neighbors. Unlike models we studied earlier, KNN doesn't perform any computations during the fitting phase but rather prepares the data for future comparisons during the prediction phase. Essentially, it does not require any training at all; it just uses the data's structure to make predictions.

Evaluating the Model

We can evaluate our model's performance by calculating accuracy to see how often it predicts correctly.

Python
1from sklearn.metrics import accuracy_score
2
3# Predict using the KNN classifier
4y_pred = knn_clf.predict(X_test)
5
6# Calculate accuracy using accuracy_score
7accuracy = accuracy_score(y_test, y_pred)
8print(f"Model accuracy: {accuracy * 100:.2f}%")  # Model accuracy: 98.33%

The score method evaluates the model using the test set, printing the accuracy as a percentage.

Performance Comparison

Let's also train the decision tree model on the same data and see how it performs. We will use a .score method instead of .predict. The score method combines two steps:

Calculate the predictions
Calculate the accuracy

We use the .score method to make the code shorter and easier to maintain when we don't need the predictions themselves but only care about the model's accuracy, like here.

Python
1from sklearn.tree import DecisionTreeClassifier
2
3dt_clf = DecisionTreeClassifier(random_state=42)
4dt_clf.fit(X_train, y_train)
5dt_accuracy = dt_clf.score(X_test, y_test)
6print(f"Decision Tree model accuracy: {dt_accuracy * 100:.2f}%")
7# Decision Tree model accuracy: 96.67%

In this case, KNN outperforms the Decision Tree. However, if we tune the decision tree a bit by limiting its depth, we can achieve the same result:

Python
1dt_clf = DecisionTreeClassifier(random_state=42, max_depth=3)
2dt_clf.fit(X_train, y_train)
3dt_accuracy = dt_clf.score(X_test, y_test)
4print(f"Decision Tree model accuracy: {dt_accuracy * 100:.2f}%")
5# Decision Tree model accuracy: 98.33%

Note that we added the max_depth=3 parameter to the model initialization and improved the model's performance. This shows the importance of tuning your models by choosing the best parameters.

In this case, the max_depth was chosen randomly. But in the last course of this course path, we will learn how to find the best possible parameters using a more controllable approach.

Lesson Summary

Great job! You've learned the basics of K-Nearest Neighbors (KNN) and how to implement it using Python and Scikit-Learn. We covered:

The concept of KNN
Loading and understanding the Iris dataset
Splitting the dataset into training and testing sets
Fitting a KNN classifier
Brief model evaluation

Now it’s time to practice. You'll engage in hands-on activities to solidify your understanding of KNN and see how it works in different scenarios. Get ready to classify data points and measure your model’s performance!

Dive into the practice exercises, and good luck!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.