Support Vector Machine (SVM) Basics

Lesson 5

Lesson Introduction

Welcome to our lesson on Support Vector Machines (SVM)! Today, we will learn about a powerful tool used for classifying data in machine learning. Have you ever thought about how a computer could tell the difference between pictures of cats and dogs? SVM is one way to make that possible by drawing a line, or more accurately, a hyperplane, to separate different categories. By the end of this lesson, you’ll be able to load a dataset, split it into training and testing sets, and train an SVM model using Python and Scikit-Learn. Let’s get started!

Introduction to SVM

First, let’s talk about what an SVM is. Imagine you have a bag full of apples and oranges on a table. You want to separate them into two groups using a straight line. This line that separates the two groups in such a way is called a hyperplane. SVM is a classification algorithm that finds the best hyperplane that separates different classes in the data.

But wait, what if the data can’t be separated by a straight line? That’s when SVM can use something called a kernel trick to transform the data into a higher dimension where a hyperplane can be used. The data points that are closest to the hyperplane are called support vectors because they “support” the hyperplane. Essentially, SVM seeks to maximize the margin between the classes through the hyperplane, leading to better generalization in classification.

Here is an example image:

Dots of different classes are separated using the hyperplane (which is just a line in this 2d case), which we call the "decision boundary". The closest samples form the support vectors, drawn as dashed lines. They help to keep the decision boundary in the optimal equally-distanced position.

In Python, we use the SVC class from the Scikit-Learn library to create an SVM.

Quick Reminder on Dataset Loading and Splitting

Remember, to work with any dataset, we need to load it and then split it into training and testing sets. Here’s a quick reminder using the wine dataset:

Python
1from sklearn.datasets import load_wine
2from sklearn.model_selection import train_test_split
3
4# Load the wine dataset
5X, y = load_wine(return_X_y=True)
6
7# Split the dataset into training and testing sets (60% training, 40% testing)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

Training the SVM Classifier

Now that we have our training and testing data ready, it’s time to create and train the SVM classifier. Think of this as teaching the computer to draw the best line that separates the data points into their correct categories.

We will use the SVC class with a linear kernel:

Python
1from sklearn.svm import SVC
2
3# Create an SVM classifier with a linear kernel
4svm_clf = SVC(kernel='linear')
5
6# Train the classifier
7svm_clf.fit(X_train, y_train)

In this code, kernel='linear' specifies that we want to use a linear kernel. We then train the classifier using fit(X_train, y_train). The other common option to use for the kernel is rbf, which stands for the radial basis function.

Comparing with Other Models

Finally, let's compare how different models perform on this dataset. We will train and evaluate Logistic Regression, Decision Tree, Naive Bayes, and k-Nearest Neighbors (kNN) models in addition to our SVM model.

Python
1from sklearn.linear_model import LogisticRegression
2from sklearn.tree import DecisionTreeClassifier
3from sklearn.naive_bayes import GaussianNB
4from sklearn.neighbors import KNeighborsClassifier
5from sklearn.svm import SVC
6
7# Dictionary of models to compare
8models = {
9    "Logistic Regression": LogisticRegression(max_iter=10000),
10    "Decision Tree": DecisionTreeClassifier(),
11    "Naive Bayes": GaussianNB(),
12    "kNN": KNeighborsClassifier(),
13    "SVM": SVC(kernel='linear')
14}
15
16# Train each model and print their accuracy on the test set
17for name, model in models.items():
18    model.fit(X_train, y_train)
19    accuracy = model.score(X_test, y_test)
20    print(f"{name} Accuracy: {accuracy:.2f}")

The output is:

Plain text
1Logistic Regression Accuracy: 0.96
2Decision Tree Accuracy: 0.94
3Naive Bayes Accuracy: 1.00
4kNN Accuracy: 0.69
5SVM Accuracy: 0.94

Here we can see all the models compared, and they show similar results. Remember that you can always improve a model by tuning it, which we will discuss in the last course of this course path. In this case, we might choose the Naive Bayes classifier as the best model based on its current performance or opt to tune the Decision Tree classifier for potential improvements. Choosing the right model involves experimenting with the data and comparing results.

Lesson Summary

Great job! Let’s recap what we’ve learned today:

SVM (Support Vector Machine) is used to classify data by finding the best hyperplane that separates different classes.
We used the Wine dataset to get some data to work with.
We split the dataset into training and testing sets using train_test_split.
We created and trained an SVM classifier using the SVC class from Scikit-Learn.
We compared the performance of SVM with Logistic Regression, Decision Tree, Naive Bayes, and k-Nearest Neighbors classifiers.

Now it’s your turn to put this knowledge into practice! Up next, you'll get hands-on experience to solidify what you’ve learned by loading a dataset, splitting it, and training an SVM model just like we did here. Dive in and start coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.