Hello! In this lesson, we're diving into a powerful technique in machine learning called Bagging. Bagging stands for Bootstrap Aggregating. Imagine making important decisions by averaging the opinions of a large group rather than relying on just one individual. This collaborative approach generally leads to better and more stable decisions. The idea behind all ensemble methods is to combine predictions from multiple models to produce a single prediction. Our goal is to understand Bagging, how it works, and how to implement it using Python's scikit-learn
library.
Bagging is an ensemble method. It improves the stability and accuracy of machine learning models by training multiple copies of a dataset and combining their results. Think of it as working with a panel of experts rather than a single adviser.
Let's break it down with a simple example:
Suppose you have a dataset of different types of flowers and you want to classify them. Instead of training just one decision tree which might overfit to your training data, you can train multiple decision trees on different subsets of your data. Each subset is created by randomly selecting samples from the original dataset (with replacement) and has the same size as the original dataset. Then, you aggregate the predictions from all the trees. This process reduces overfitting and leads to a more robust model.
It is important to note that a decision tree is just an example. You can use any model with bagging.
Let's start by loading a dataset. Think of it as a table of data where each row is an example we're learning from, and each column is a feature or quality about the examples. For today, we'll use a dataset about wine. This dataset comes with scikit-learn
, so it's easy to load.
Here's the code to load the dataset:
Python1from sklearn.datasets import load_wine 2 3# Load dataset 4X, y = load_wine(return_X_y=True) 5# Note: The output is a tuple of feature matrix X and target vector y
In this code, X
represents the features of the dataset, and y
represents the labels (the class of wine).
To test our model properly, we split our data into training and testing parts, like studying for a test and then taking it. Use train_test_split
from scikit-learn
to do this:
Python1from sklearn.model_selection import train_test_split 2 3# Split the dataset 4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 5 6# X_train, X_test, y_train, and y_test are arrays 7# For instance, len(X_train) would be 142, which is 80% of 178 samples
test_size=0.2
uses 20% of the data for testing and 80% for training.random_state=42
ensures the split is the same each time you run the code.
Before we dive into Bagging, let's first build a simple decision tree to see its performance:
Python1from sklearn.tree import DecisionTreeClassifier 2from sklearn.metrics import accuracy_score 3 4# Train a single decision tree classifier 5tree_clf = DecisionTreeClassifier(random_state=42) 6tree_clf.fit(X_train, y_train) 7 8# Predict and calculate accuracy 9y_pred_tree = tree_clf.predict(X_test) 10tree_accuracy = accuracy_score(y_test, y_pred_tree) 11 12print(f"Accuracy of single Decision Tree: {tree_accuracy:.2f}") # 0.94
Now let's create our Bagging classifier. We’ll start by defining the Bagging classifier and specifying its parameters:
Python1from sklearn.ensemble import BaggingClassifier 2from sklearn.tree import DecisionTreeClassifier 3 4# Train a bagging classifier 5bag_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
In this code:
- We create a
BaggingClassifier
, our team captain organizing the mini-models. estimator=DecisionTreeClassifier()
means each mini-model is a decision tree.n_estimators=100
means we’ll have 100 mini-models (or decision trees) on our team.
Let's continue by training the classifier with our training data:
Python1bag_clf.fit(X_train, y_train)
Finally, let's make predictions with our Bagging classifier and evaluate its performance:
Python1# Predict and calculate accuracy 2y_pred_bag = bag_clf.predict(X_test) 3bag_accuracy = accuracy_score(y_test, y_pred_bag) 4 5print(f"Accuracy of Bagging Classifier: {bag_accuracy:.2f}") # 0.97
We see that using the bagging technique helped us to improve the resulting accuracy from 0.94 to 0.97.
While bagging offers numerous benefits, it also has some drawbacks:
Advantages:
- Reduced Overfitting: By aggregating the results from multiple models, bagging helps to minimize overfitting.
- Improved Accuracy: The overall performance of the ensemble method is generally better than that of a single model.
- Stability: Bagging provides more stable predictions by reducing the variance in the model's output.
Disadvantages:
- Increased Computational Cost: Training multiple models can be computationally expensive and time-consuming.
- Complexity: Combining multiple models can make the model more complex and harder to interpret.
Well done! You've learned what Bagging is, why it's useful, and how it works through an example. You also learned how to load a dataset, split it, and build both a single Decision Tree and a Bagging classifier using scikit-learn
. We've shown that the Bagging classifier typically performs better by combining the results of multiple decision trees.
Now, it’s time for hands-on practice! Apply what you've learned by writing the code yourself. Ready? Let's get started!