Bagging in Machine Learning

Introduction to Machine Learning with SciKit Learn

Lesson 1

Bagging in Machine Learning

Lesson Introduction

Hello! In this lesson, we're diving into a powerful technique in machine learning called Bagging. Bagging stands for Bootstrap Aggregating. Imagine making important decisions by averaging the opinions of a large group rather than relying on just one individual. This collaborative approach generally leads to better and more stable decisions. The idea behind all ensemble methods is to combine predictions from multiple models to produce a single prediction. Our goal is to understand Bagging, how it works, and how to implement it using Python's scikit-learn library.

Bagging is an ensemble method. It improves the stability and accuracy of machine learning models by training multiple copies of a dataset and combining their results. Think of it as working with a panel of experts rather than a single adviser.

How Bagging Works: An Example

Let's break it down with a simple example:

Suppose you have a dataset of different types of flowers and you want to classify them. Instead of training just one decision tree which might overfit to your training data, you can train multiple decision trees on different subsets of your data. Each subset is created by randomly selecting samples from the original dataset (with replacement) and has the same size as the original dataset. Then, you aggregate the predictions from all the trees. This process reduces overfitting and leads to a more robust model.

It is important to note that a decision tree is just an example. You can use any model with bagging.

Loading a Dataset and Splitting the Data

Let's start by loading a dataset. Think of it as a table of data where each row is an example we're learning from, and each column is a feature or quality about the examples. For today, we'll use a dataset about wine. This dataset comes with scikit-learn, so it's easy to load.

Here's the code to load the dataset:

Python
1from sklearn.datasets import load_wine
2
3# Load dataset
4X, y = load_wine(return_X_y=True)
5# Note: The output is a tuple of feature matrix X and target vector y

In this code, X represents the features of the dataset, and y represents the labels (the class of wine).

To test our model properly, we split our data into training and testing parts, like studying for a test and then taking it. Use train_test_split from scikit-learn to do this:

Python
1from sklearn.model_selection import train_test_split
2
3# Split the dataset
4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5
6# X_train, X_test, y_train, and y_test are arrays
7# For instance, len(X_train) would be 142, which is 80% of 178 samples

test_size=0.2 uses 20% of the data for testing and 80% for training.
random_state=42 ensures the split is the same each time you run the code.

Building and Training a Single Decision Tree Classifier

Before we dive into Bagging, let's first build a simple decision tree to see its performance:

Python
1from sklearn.tree import DecisionTreeClassifier
2from sklearn.metrics import accuracy_score
3
4# Train a single decision tree classifier
5tree_clf = DecisionTreeClassifier(random_state=42)
6tree_clf.fit(X_train, y_train)
7
8# Predict and calculate accuracy
9y_pred_tree = tree_clf.predict(X_test)
10tree_accuracy = accuracy_score(y_test, y_pred_tree)
11
12print(f"Accuracy of single Decision Tree: {tree_accuracy:.2f}")  # 0.94

Building and Training a Bagging Classifier: Part 1

Now let's create our Bagging classifier. We’ll start by defining the Bagging classifier and specifying its parameters:

Python
1from sklearn.ensemble import BaggingClassifier
2from sklearn.tree import DecisionTreeClassifier
3
4# Train a bagging classifier
5bag_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)

In this code:

We create a BaggingClassifier, our team captain organizing the mini-models.
estimator=DecisionTreeClassifier() means each mini-model is a decision tree.
n_estimators=100 means we’ll have 100 mini-models (or decision trees) on our team.

Building and Training a Bagging Classifier: Part 2

Let's continue by training the classifier with our training data:

Python
1bag_clf.fit(X_train, y_train)

Finally, let's make predictions with our Bagging classifier and evaluate its performance:

Python
1# Predict and calculate accuracy
2y_pred_bag = bag_clf.predict(X_test)
3bag_accuracy = accuracy_score(y_test, y_pred_bag)
4
5print(f"Accuracy of Bagging Classifier: {bag_accuracy:.2f}")  # 0.97

We see that using the bagging technique helped us to improve the resulting accuracy from 0.94 to 0.97.

Advantages and Disadvantages of Bagging

While bagging offers numerous benefits, it also has some drawbacks:

Advantages:

Reduced Overfitting: By aggregating the results from multiple models, bagging helps to minimize overfitting.
Improved Accuracy: The overall performance of the ensemble method is generally better than that of a single model.
Stability: Bagging provides more stable predictions by reducing the variance in the model's output.

Disadvantages:

Increased Computational Cost: Training multiple models can be computationally expensive and time-consuming.
Complexity: Combining multiple models can make the model more complex and harder to interpret.

Lesson Summary and Practice Introduction

Well done! You've learned what Bagging is, why it's useful, and how it works through an example. You also learned how to load a dataset, split it, and build both a single Decision Tree and a Bagging classifier using scikit-learn. We've shown that the Bagging classifier typically performs better by combining the results of multiple decision trees.

Now, it’s time for hands-on practice! Apply what you've learned by writing the code yourself. Ready? Let's get started!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.