Lesson 4
Naive Bayes Basics
Lesson Introduction

Hey there! Today we are going to explore an exciting topic in machine learning called Naive Bayes. By the end of this lesson, you'll understand what Naive Bayes is and how to implement it using Python's Scikit-Learn library. Let’s dive in!

Understanding Naive Bayes

Naive Bayes is a classification algorithm based on Bayes' Theorem. Imagine you’re a detective using clues (features) to decide who the culprit is (class). Naive Bayes helps by calculating probabilities.

Bayes' Theorem is stated as:

P(CX)=P(XC)P(C)P(X)P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}

Where:

  • P(CX)P(C|X) is the posterior probability of class CC given predictor XX.
  • P(XC)P(X|C) is the likelihood which is the probability of predictor XX given class CC.
  • P(C)P(C) is the prior probability of class CC.
  • P(X)P(X) is the prior probability of predictor XX.

How Naive Bayes Works

  1. Prior Probability: The algorithm starts by calculating the prior probability for each class based on the training data. It is simply the probability of a sample being of the class CC if we know no data about the sample. For example, imagine we predict if the email is spam or not. If the 93% of the emails in the data are not spam, then it is reasonable to suppose that a given email will be not spam with the probability of 93%. This is what the prior probability is.
  2. Likelihood: For each feature, the likelihood (probability of the feature given the class) is calculated. It is essentially the probability of a sample with a given feature to be of the given class.
  3. Independent Features Assumption (Naive Assumption): Assumes that the features are independent, which simplifies calculations.
  4. Posterior Probability: Using Bayes' Theorem, the posterior probability of each class is computed given the feature values. The class with the highest posterior probability is chosen as the prediction.
How Naive Bayes Learns

Naive Bayes updates its likelihoods and priors using the training data. When the model encounters new data, it breaks the data into its constituent features and applies Bayes' Theorem to calculate the class probabilities. The class with the highest probability is the predicted class.

We will focus on GaussianNB, commonly used when features are continuous and assumed to follow a normal (Gaussian) distribution.

Loading the Dataset

Before training our Naive Bayes classifier, we need data. Consider it like needing mystery stories before solving them! We'll use the Iris dataset, which includes features of iris flowers to classify them into species.

Let’s quickly remind ourselves how to load the dataset using Scikit-Learn:

Python
1from sklearn.datasets import load_iris 2from sklearn.model_selection import train_test_split 3 4# Load the Iris dataset 5X, y = load_iris(return_X_y=True) 6 7# Split the dataset into training and testing sets 8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
Training the Naive Bayes Classifier

Now that we have our data split, it’s time to train our Naive Bayes classifier using GaussianNB:

Python
1from sklearn.naive_bayes import GaussianNB 2 3# Initialize the Naive Bayes classifier 4nb_clf = GaussianNB() 5 6# Train the classifier with training data 7nb_clf.fit(X_train, y_train)

Here, fit trains the model using the training data, much like a student learning from textbooks.

Making Predictions and Calculating Accuracy

After training the model, let’s make predictions on the test data and calculate the accuracy:

Python
1from sklearn.metrics import accuracy_score 2 3# Make predictions on the testing set 4y_pred = nb_clf.predict(X_test) 5 6# Calculate accuracy 7accuracy = accuracy_score(y_test, y_pred) 8print(f"Bayes model accuracy: {accuracy * 100:.2f}%") 9# Bayes model accuracy: 96.67%

Here, y_pred contains the predicted class labels for the test set, and accuracy_score compares these predictions to the true labels (y_test) to calculate the model's accuracy.

Lesson Summary

Great job! You've learned how to use the Naive Bayes classifier for machine learning tasks. Here’s a quick recap:

  • Naive Bayes: A probabilistic classifier based on Bayes' Theorem.
  • Dataset Loading: Used Scikit-Learn's load_iris to load the Iris dataset.
  • Train-Test Split: Used train_test_split to split data into training and testing sets.
  • Model Training: Used GaussianNB to train the Naive Bayes classifier.
  • Making Predictions and Calculating Accuracy: Predicted test set labels and calculated the model's accuracy.

Now it’s time to roll up your sleeves and get hands-on practice! In the next section, you'll implement what you’ve learned and see the Naive Bayes classifier in action on your own. Excited? Let’s get started!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.