Mastering Text Classification with Naive Bayes in Python

Lesson 2

Overview: Text Classification With Naive Bayes

Hello and welcome! Today, we will explore the world of text classification using the Naive Bayes algorithm, specifically in Python using the library Scikit-learn. By the end of this lesson, you will understand how Naive Bayes works, how to implement a Naive Bayes model in Python, and how to evaluate its performance. Let's get started!

Understanding the Fundamentals of Naive Bayes

The Naive Bayes algorithm is a category of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. It provides a way to calculate the probability that a certain event will occur given that another event has already occurred. With text classification, the event we're curious about is a specific class label, such as spam or ham (not spam). The given event is the text input we have — a particular SMS in our case.

The 'naive' in Naive Bayes comes from the assumption that each feature contributes independently to the probability of a particular outcome. This assumption often isn't valid in the real world (words in an SMS are often far from independent), but the Naive Bayes algorithm still tends to perform very well in the field of text classification, particularly for such a simple and fast method.

Dataset Loading and Preparation

Before we start building our Naive Bayes model, let's load our dataset and perform the necessary preparations:

Python
1# Import the necessary libraries
2import pandas as pd
3from sklearn.feature_extraction.text import CountVectorizer
4from sklearn.naive_bayes import MultinomialNB
5from sklearn.model_selection import train_test_split
6import datasets
7
8# Load the dataset
9spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
10spam_dataset = pd.DataFrame(spam_dataset)
11
12# Define X (input features) and Y (output labels)
13X = spam_dataset["message"]
14Y = spam_dataset["label"]
15
16# Perform the train test split using stratified cross-validation
17X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

In the above block of code, we're loading our SMS dataset and conducting a train-test split for the data. These steps serve as the preliminary stages in preparing our dataset for the modeling process. By separating our data into a training set and test set, we ensure that our model can learn from one portion of the data (the training set) and then have its performance evaluated on unseen data (the test set).

Text Data Transformation Into Numerical Features

Before we dive into building the Naive Bayes model, it's essential to prepare our data. Given that our machine learning algorithms operate on numeric data, we must first convert our SMS text data into numerical features:

Python
1# Initialize the CountVectorizer
2count_vectorizer = CountVectorizer()
3
4# Fit and transform the training data 
5X_train_count = count_vectorizer.fit_transform(X_train)
6
7# Transform the test data
8X_test_count = count_vectorizer.transform(X_test)

In the above block of code, we implement the CountVectorizer, a crucial step in text classification. CountVectorizer performs two important tasks - firstly, it tokenizes the sentences, breaking the text down into individual words. Secondly, it counts the frequency of each word in each sentence. It then uses this information to transform each sentence into a numerical vector that our machine learning model can understand and process. The vectors produced by CountVectorizer result in a matrix of token counts - X_train_count and X_test_count.

Building the Naive Bayes Model

Now that we've transformed our text data into numerical vectors, we are in a position to create our Naive Bayes classifier:

Python
1# Initialize the MultinomialNB model
2naive_bayes_model = MultinomialNB()
3
4# Fit the model on the training data
5naive_bayes_model.fit(X_train_count, Y_train)
6
7# Make predictions on the test data
8y_pred = naive_bayes_model.predict(X_test_count)

Here we are initializing a Naive Bayes classifier using the MultinomialNB class from Scikit-learn. The fit method trains our model on the training data, learning the probabilities of each label (spam or ham) given the input features (token counts). Once the model is trained, we use the predict method to make predictions on our test data.

Predicting and Evaluating the Model Performance

Accuracy is a common metric for classification. We calculate it as a ratio of the number of correct predictions to the total number of input samples:

Python
1# Calculate the accuracy of the model
2accuracy = metrics.accuracy_score(Y_test, y_pred)
3
4# Print the accuracy
5print(f"Accuracy of Naive Bayes Classifier: {accuracy:.2f}")

The output will be:

Plain text
1Accuracy of Naive Bayes Classifier: 0.98

This indicates that our classifier has a very high accuracy rate, only rarely misclassifying SMS messages. This high level of accuracy demonstrates the effectiveness of the Naive Bayes classifier for the task of text classification.

Lesson Summary and Practice

Well done on reaching the end of this lesson! We got an understanding of the Naive Bayes algorithm, implemented it in Python for text classification, and evaluated its performance. The Naive Bayes classifier is a powerful and fast classification tool ideal for text data, even if its assumptions basically ignore the semantics of text.

In the upcoming exercises, you will get the chance to implement a Naive Bayes classifier and gain valuable hands-on experience. Remember that practicing what you've learned is an essential step in your learning journey. So, get your hands dirty with our exercises and improve your problem-solving abilities and understanding of the Naive Bayes classifier. Let's go! Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.