Lesson 3

Mastering Support Vector Machines for Effective Text Classification

A Brief Introduction to Support Vector Machines (SVM)

In machine learning, Support Vector Machines (SVMs) are classification algorithms that you can use to label data into different classes. The SVM algorithm segregates data into two groups by finding a hyperplane in a high-dimensional space (or surface, in case of more than two features) that distinctly classifies the data points. The algorithm chooses the hyperplane that represents the largest separation, or margin, between classes.

SVM is extremely useful for solving nonlinear text classification problems. It can efficiently perform a non-linear classification using the "kernel trick," implicitly mapping the inputs into high-dimensional feature spaces.

In summary, SVM's distinguishing factors are:

  • Hyperplanes: These are decision boundaries that help SVM separate data into different classes.
  • Support Vectors: These are the data points that lie closest to the decision surface (or hyperplane). They are critical elements of SVM because they help maximize the margin of the classifier.
  • Kernel Trick: The kernel helps SVM to deal with non-linear input spaces by using a higher dimension space.
  • Soft Margin: SVM allows some misclassifications in its model for better performance. This flexibility is introduced through a concept called Soft Margin.
Loading and Preprocessing the Data

This section is a quick revisit of the code you are already familiar with. We are just loading and preprocessing the SMS Spam Collection dataset.

1# Import the necessary libraries 2import pandas as pd 3from sklearn.feature_extraction.text import CountVectorizer 4from sklearn import metrics 5from sklearn.svm import SVC 6from sklearn.model_selection import train_test_split 7import datasets 8 9# Load the dataset 10spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train') 11spam_dataset = pd.DataFrame(spam_dataset) 12 13# Define X (input features) and Y (output labels) 14X = spam_dataset["message"] 15Y = spam_dataset["label"] 16 17# Perform the train test split using stratified cross-validation 18X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y) 19 20# Initialize the CountVectorizer 21count_vectorizer = CountVectorizer() 22 23# Fit and transform the training data 24X_train_count = count_vectorizer.fit_transform(X_train) 25 26# Transform the test data 27X_test_count = count_vectorizer.transform(X_test)
Implementing Support Vector Machines for Text Classification

Let's delve into the practical implementation of SVM for text classification using the Scikit-learn library. We are going to introduce a new Scikit-learn function, SVC(). This function is used to fit the SVM model according to the given training data.

In the following Python code, we initialize the SVC model, fit it with our training data, and then make predictions on the test dataset.

1# Initialize the SVC model 2svm_model = SVC() 3 4# Fit the model on the training data 5svm_model.fit(X_train_count, Y_train) 6 7# Make predictions on the test data 8y_pred = svm_model.predict(X_test_count)

The SVC function takes several parameters, with the key ones being:

  • C: This is the penalty parameter of the error term. It controls the trade off between smooth decision boundary and classifying training points correctly.
  • kernel: Specifies the kernel type to be used in the algorithm. It can be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable.
  • degree: Degree of the polynomial kernel function ('poly'). Ignored by all other kernels. 

Making Predictions and Evaluating the SVM Model

After building the model, the next step is to use it on unseen data and evaluate its performance. The python code for this step is shown below:

1# Make predictions on the test data 2y_pred = svm_model.predict(X_test_count) 3 4# Calculate the accuracy of the model 5accuracy = metrics.accuracy_score(Y_test, y_pred) 6 7# Print the accuracy 8print(f"Accuracy of Support Vector Machines Classifier: {accuracy:.2f}")

The output of the above code will be:

Plain text
1Accuracy of Support Vector Machines Classifier: 0.98

This output signifies that our SVM model has achieved a high accuracy, specifically 98%, in classifying messages as spam or ham, highlighting its effectiveness in text classification tasks.

Lesson Summary and Upcoming Practice

Congratulations on making it to the end of this lesson! You have now learned the theory behind Support Vector Machines (SVMs) and how to use them to perform text classification in Python. You've also learned to load and preprocess the data, build the SVM model, and evaluate its accuracy.

Remember, like any other skill, programming requires practice. The upcoming practice exercises will allow you to reinforce the knowledge you've acquired in this lesson. They have been carefully designed to give you further expertise in SVM and text classification. Good luck! You're doing a great job, and I'm excited to see you in the next lesson on Decision Trees for text classification.

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.