Lesson 4

Hey there! Today we are going to explore an exciting topic in machine learning called *Naive Bayes*. By the end of this lesson, you'll understand what `Naive Bayes`

is and how to implement it using Python's `Scikit-Learn`

library. Let’s dive in!

`Naive Bayes`

is a classification algorithm based on *Bayes' Theorem*. Imagine you’re a detective using clues (features) to decide who the culprit is (class). `Naive Bayes`

helps by calculating probabilities.

Bayes' Theorem is stated as:

$P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}$

Where:

- $P(C|X)$ is the posterior probability of class $C$ given predictor $X$.
- $P(X|C)$ is the likelihood which is the probability of predictor $X$ given class $C$.
- $P(C)$ is the prior probability of class $C$.
- $P(X)$ is the prior probability of predictor $X$.

How Naive Bayes Works

**Prior Probability**: The algorithm starts by calculating the prior probability for each class based on the training data. It is simply the probability of a sample being of the class $C$ if we know no data about the sample. For example, imagine we predict if the email is spam or not. If the 93% of the emails in the data are not spam, then it is reasonable to suppose that a given email will be not spam with the probability of`93%`

. This is what the prior probability is.**Likelihood**: For each feature, the likelihood (probability of the feature given the class) is calculated. It is essentially the probability of a sample with a given feature to be of the given class.**Independent Features Assumption (Naive Assumption)**: Assumes that the features are independent, which simplifies calculations.**Posterior Probability**: Using Bayes' Theorem, the posterior probability of each class is computed given the feature values. The class with the highest posterior probability is chosen as the prediction.

`Naive Bayes`

updates its likelihoods and priors using the training data. When the model encounters new data, it breaks the data into its constituent features and applies Bayes' Theorem to calculate the class probabilities. The class with the highest probability is the predicted class.

We will focus on `GaussianNB`

, commonly used when features are continuous and assumed to follow a normal (Gaussian) distribution.

Before training our `Naive Bayes`

classifier, we need data. Consider it like needing mystery stories before solving them! We'll use the **Iris dataset**, which includes features of iris flowers to classify them into species.

Let’s quickly remind ourselves how to load the dataset using `Scikit-Learn`

:

Python`1from sklearn.datasets import load_iris 2from sklearn.model_selection import train_test_split 3 4# Load the Iris dataset 5X, y = load_iris(return_X_y=True) 6 7# Split the dataset into training and testing sets 8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)`

Now that we have our data split, it’s time to train our `Naive Bayes`

classifier using `GaussianNB`

:

Python`1from sklearn.naive_bayes import GaussianNB 2 3# Initialize the Naive Bayes classifier 4nb_clf = GaussianNB() 5 6# Train the classifier with training data 7nb_clf.fit(X_train, y_train)`

Here, `fit`

trains the model using the training data, much like a student learning from textbooks.

After training the model, let’s make predictions on the test data and calculate the accuracy:

Python`1from sklearn.metrics import accuracy_score 2 3# Make predictions on the testing set 4y_pred = nb_clf.predict(X_test) 5 6# Calculate accuracy 7accuracy = accuracy_score(y_test, y_pred) 8print(f"Bayes model accuracy: {accuracy * 100:.2f}%") 9# Bayes model accuracy: 96.67%`

Here, `y_pred`

contains the predicted class labels for the test set, and `accuracy_score`

compares these predictions to the true labels (`y_test`

) to calculate the model's accuracy.

Great job! You've learned how to use the `Naive Bayes`

classifier for machine learning tasks. Here’s a quick recap:

**Naive Bayes**: A probabilistic classifier based on Bayes' Theorem.**Dataset Loading**: Used`Scikit-Learn`

's`load_iris`

to load the**Iris dataset**.**Train-Test Split**: Used`train_test_split`

to split data into training and testing sets.**Model Training**: Used`GaussianNB`

to train the`Naive Bayes`

classifier.**Making Predictions and Calculating Accuracy**: Predicted test set labels and calculated the model's accuracy.

Now it’s time to roll up your sleeves and get hands-on practice! In the next section, you'll implement what you’ve learned and see the `Naive Bayes`

classifier in action on your own. Excited? Let’s get started!