In machine learning, Support Vector Machines (SVMs) are classification algorithms that you can use to label data into different classes. The SVM
algorithm segregates data into two groups by finding a hyperplane in a high-dimensional space (or surface, in case of more than two features) that distinctly classifies the data points. The algorithm chooses the hyperplane that represents the largest separation, or margin, between classes.
SVM
is extremely useful for solving nonlinear text classification problems. It can efficiently perform a non-linear classification using the "kernel trick," implicitly mapping the inputs into high-dimensional feature spaces.
In summary, SVM
's distinguishing factors are:
SVM
separate data into different classes.SVM
because they help maximize the margin of the classifier.SVM
to deal with non-linear input spaces by using a higher dimension space.SVM
allows some misclassifications in its model for better performance. This flexibility is introduced through a concept called Soft Margin.This section is a quick revisit of the code you are already familiar with. We are just loading and preprocessing the SMS Spam Collection dataset.
Python1# Import the necessary libraries 2import pandas as pd 3from sklearn.feature_extraction.text import CountVectorizer 4from sklearn import metrics 5from sklearn.svm import SVC 6from sklearn.model_selection import train_test_split 7import datasets 8 9# Load the dataset 10spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train') 11spam_dataset = pd.DataFrame(spam_dataset) 12 13# Define X (input features) and Y (output labels) 14X = spam_dataset["message"] 15Y = spam_dataset["label"] 16 17# Perform the train test split using stratified cross-validation 18X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y) 19 20# Initialize the CountVectorizer 21count_vectorizer = CountVectorizer() 22 23# Fit and transform the training data 24X_train_count = count_vectorizer.fit_transform(X_train) 25 26# Transform the test data 27X_test_count = count_vectorizer.transform(X_test)
Let's delve into the practical implementation of SVM
for text classification using the Scikit-learn
library. We are going to introduce a new Scikit-learn
function, SVC()
. This function is used to fit the SVM
model according to the given training data.
In the following Python code, we initialize the SVC
model, fit it with our training data, and then make predictions on the test dataset.
Python1# Initialize the SVC model 2svm_model = SVC() 3 4# Fit the model on the training data 5svm_model.fit(X_train_count, Y_train) 6 7# Make predictions on the test data 8y_pred = svm_model.predict(X_test_count)
The SVC
function takes several parameters, with the key ones being:
C
: This is the penalty parameter of the error term. It controls the trade off between smooth decision boundary and classifying training points correctly.kernel
: Specifies the kernel type to be used in the algorithm. It can be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable.degree
: Degree of the polynomial kernel function ('poly'). Ignored by all other kernels.
After building the model, the next step is to use it on unseen data and evaluate its performance. The python code for this step is shown below:
Python1# Make predictions on the test data 2y_pred = svm_model.predict(X_test_count) 3 4# Calculate the accuracy of the model 5accuracy = metrics.accuracy_score(Y_test, y_pred) 6 7# Print the accuracy 8print(f"Accuracy of Support Vector Machines Classifier: {accuracy:.2f}")
The output of the above code will be:
Plain text1Accuracy of Support Vector Machines Classifier: 0.98
This output signifies that our SVM
model has achieved a high accuracy, specifically 98%, in classifying messages as spam or ham, highlighting its effectiveness in text classification tasks.
Congratulations on making it to the end of this lesson! You have now learned the theory behind Support Vector Machines (SVMs) and how to use them to perform text classification in Python. You've also learned to load and preprocess the data, build the SVM
model, and evaluate its accuracy.
Remember, like any other skill, programming requires practice. The upcoming practice exercises will allow you to reinforce the knowledge you've acquired in this lesson. They have been carefully designed to give you further expertise in SVM
and text classification. Good luck! You're doing a great job, and I'm excited to see you in the next lesson on Decision Trees for text classification.