Welcome to the lesson on K-Nearest Neighbors (KNN)! Today, we’ll explore this simple yet intuitive algorithm. KNN is used for classification and regression tasks. Our goal is to understand how KNN works and implement it in Python using Scikit-Learn
. By the end, you'll be able to classify data points based on their features.
What is KNN? Imagine identifying a fruit as an apple or an orange. Instead of using a dictionary, you ask nearby people for their opinions. The majority wins. This is the idea behind KNN, classifying a data point based on its nearest neighbors.
Let's take a look at an example:
In this image, we see a target point (black cross) that we want to predict the class for. This target point's three nearest neighbors are two red points and one green point. As the majority of the neighbors are red points, the target point will be also classified as a red point.
Why use KNN? It's easy to understand and implement. It is useful in recommending products, recognizing medical patterns, etc.
Let's load the Iris dataset, which contains information about different flowers. Here's how we do it using Scikit-Learn
:
Python1from sklearn.datasets import load_iris 2from sklearn.model_selection import train_test_split 3 4 5# Load the Iris dataset 6X, y = load_iris(return_X_y=True) 7 8# Splitting the dataset into training and testing sets 9X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42) 10
This code loads the Iris dataset and splits it into features X
and labels y
. The dataset contains various information about flowers, including Sepal Length, Sepal Width, Petal Length and Petal Width. Our goal is to predict the type of the flower, which is one of the following: Setosa, Versicolour, and Virginica.
Note, that as we predict three classes instead of two, this is not a binary classification. Now we are working with a multiclass classification. Luckily for us, the KNN-classifier is perfectly suitable for this type of tasks. Decision trees can also be used for it, as we saw in the previous lesson.
One important thing to know is that the logistic regression is not suitable for multiclass classification in its original form. However, it can be adapted for this type of task using techniques like One-vs-Rest (OvR) or Softmax Regression (Multinomial Logistic Regression).
Now, let's initialize and fit our KNN classifier. KNN classifies a data point based on the majority class among its nearest neighbors. We’ll start with k=3
, meaning we will check the three nearest neighbors when making a prediction.
Here’s how to fit a KNN classifier:
Python1from sklearn.neighbors import KNeighborsClassifier 2 3# Initialize the KNN classifier with 3 neighbors 4knn_clf = KNeighborsClassifier(n_neighbors=3) 5 6# Fit the KNN classifier 7knn_clf.fit(X_train, y_train)
This initializes the KNeighborsClassifier
with 3 neighbors. Unlike models we studied earlier, KNN doesn't perform any computations during the fitting phase but rather prepares the data for future comparisons during the prediction phase. Essentially, it does not require any training at all; it just uses the data's structure to make predictions.
We can evaluate our model's performance by calculating accuracy to see how often it predicts correctly.
Python1from sklearn.metrics import accuracy_score 2 3# Predict using the KNN classifier 4y_pred = knn_clf.predict(X_test) 5 6# Calculate accuracy using accuracy_score 7accuracy = accuracy_score(y_test, y_pred) 8print(f"Model accuracy: {accuracy * 100:.2f}%") # Model accuracy: 98.33%
The score
method evaluates the model using the test set, printing the accuracy as a percentage.
Let's also train the decision tree model on the same data and see how it performs. We will use a .score
method instead of .predict
. The score
method combines two steps:
- Calculate the predictions
- Calculate the accuracy
We use the .score
method to make the code shorter and easier to maintain when we don't need the predictions themselves but only care about the model's accuracy, like here.
Python1from sklearn.tree import DecisionTreeClassifier 2 3dt_clf = DecisionTreeClassifier(random_state=42) 4dt_clf.fit(X_train, y_train) 5dt_accuracy = dt_clf.score(X_test, y_test) 6print(f"Decision Tree model accuracy: {dt_accuracy * 100:.2f}%") 7# Decision Tree model accuracy: 96.67%
In this case, KNN
outperforms the Decision Tree
. However, if we tune the decision tree a bit by limiting its depth, we can achieve the same result:
Python1dt_clf = DecisionTreeClassifier(random_state=42, max_depth=3) 2dt_clf.fit(X_train, y_train) 3dt_accuracy = dt_clf.score(X_test, y_test) 4print(f"Decision Tree model accuracy: {dt_accuracy * 100:.2f}%") 5# Decision Tree model accuracy: 98.33%
Note that we added the max_depth=3
parameter to the model initialization and improved the model's performance. This shows the importance of tuning your models by choosing the best parameters.
In this case, the max_depth
was chosen randomly. But in the last course of this course path, we will learn how to find the best possible parameters using a more controllable approach.
Great job! You've learned the basics of K-Nearest Neighbors (KNN) and how to implement it using Python and Scikit-Learn
. We covered:
- The concept of KNN
- Loading and understanding the Iris dataset
- Splitting the dataset into training and testing sets
- Fitting a KNN classifier
- Brief model evaluation
Now it’s time to practice. You'll engage in hands-on activities to solidify your understanding of KNN and see how it works in different scenarios. Get ready to classify data points and measure your model’s performance!
Dive into the practice exercises, and good luck!