Hello and welcome! Today, we will explore the world of text classification using the Naive Bayes algorithm, specifically in Python using the library Scikit-learn. By the end of this lesson, you will understand how Naive Bayes works, how to implement a Naive Bayes model in Python, and how to evaluate its performance. Let's get started!
The Naive Bayes algorithm is a category of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. It provides a way to calculate the probability that a certain event will occur given that another event has already occurred. With text classification, the event we're curious about is a specific class label, such as spam
or ham
(not spam). The given event is the text input we have — a particular SMS in our case.
The 'naive' in Naive Bayes comes from the assumption that each feature contributes independently to the probability of a particular outcome. This assumption often isn't valid in the real world (words in an SMS are often far from independent), but the Naive Bayes algorithm still tends to perform very well in the field of text classification, particularly for such a simple and fast method.
Before we start building our Naive Bayes model, let's load our dataset and perform the necessary preparations:
Python1# Import the necessary libraries 2import pandas as pd 3from sklearn.feature_extraction.text import CountVectorizer 4from sklearn.naive_bayes import MultinomialNB 5from sklearn.model_selection import train_test_split 6import datasets 7 8# Load the dataset 9spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train') 10spam_dataset = pd.DataFrame(spam_dataset) 11 12# Define X (input features) and Y (output labels) 13X = spam_dataset["message"] 14Y = spam_dataset["label"] 15 16# Perform the train test split using stratified cross-validation 17X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
In the above block of code, we're loading our SMS dataset and conducting a train-test split for the data. These steps serve as the preliminary stages in preparing our dataset for the modeling process. By separating our data into a training set and test set, we ensure that our model can learn from one portion of the data (the training set) and then have its performance evaluated on unseen data (the test set).
Before we dive into building the Naive Bayes model, it's essential to prepare our data. Given that our machine learning algorithms operate on numeric data, we must first convert our SMS text data into numerical features:
Python1# Initialize the CountVectorizer 2count_vectorizer = CountVectorizer() 3 4# Fit and transform the training data 5X_train_count = count_vectorizer.fit_transform(X_train) 6 7# Transform the test data 8X_test_count = count_vectorizer.transform(X_test)
In the above block of code, we implement the CountVectorizer
, a crucial step in text classification. CountVectorizer
performs two important tasks - firstly, it tokenizes the sentences, breaking the text down into individual words. Secondly, it counts the frequency of each word in each sentence. It then uses this information to transform each sentence into a numerical vector that our machine learning model can understand and process. The vectors produced by CountVectorizer
result in a matrix of token counts - X_train_count
and X_test_count
.
Now that we've transformed our text data into numerical vectors, we are in a position to create our Naive Bayes classifier:
Python1# Initialize the MultinomialNB model 2naive_bayes_model = MultinomialNB() 3 4# Fit the model on the training data 5naive_bayes_model.fit(X_train_count, Y_train) 6 7# Make predictions on the test data 8y_pred = naive_bayes_model.predict(X_test_count)
Here we are initializing a Naive Bayes classifier using the MultinomialNB
class from Scikit-learn. The fit
method trains our model on the training data, learning the probabilities of each label (spam or ham) given the input features (token counts). Once the model is trained, we use the predict
method to make predictions on our test data.
Accuracy is a common metric for classification. We calculate it as a ratio of the number of correct predictions to the total number of input samples:
Python1# Calculate the accuracy of the model 2accuracy = metrics.accuracy_score(Y_test, y_pred) 3 4# Print the accuracy 5print(f"Accuracy of Naive Bayes Classifier: {accuracy:.2f}")
The output will be:
Plain text1Accuracy of Naive Bayes Classifier: 0.98
This indicates that our classifier has a very high accuracy rate, only rarely misclassifying SMS messages. This high level of accuracy demonstrates the effectiveness of the Naive Bayes classifier for the task of text classification.
Well done on reaching the end of this lesson! We got an understanding of the Naive Bayes algorithm, implemented it in Python for text classification, and evaluated its performance. The Naive Bayes classifier is a powerful and fast classification tool ideal for text data, even if its assumptions basically ignore the semantics of text.
In the upcoming exercises, you will get the chance to implement a Naive Bayes classifier and gain valuable hands-on experience. Remember that practicing what you've learned is an essential step in your learning journey. So, get your hands dirty with our exercises and improve your problem-solving abilities and understanding of the Naive Bayes classifier. Let's go! Happy coding!