Implementing Bagging with Decision Trees in Python

Lesson 1

Introduction

Welcome to our exploration of Implementing Bagging. This lesson expands upon your machine learning toolkit by introducing you to the bagging technique and illustrating its use with decision trees. You will also gain hands-on experience with these concepts through a Python implementation. So, let's embark on our bagging adventure!

Understanding Bagging

Bagging, or bootstrap aggregating, is a technique in ensemble learning that aims to reduce the variance of the machine learning model. The essence of bagging involves generating multiple subsets from the original dataset and then using these subsets to train separate models. Note that the subsets are chosen with replacement, so it is possible to have duplicate data points in a single subset. The final prediction is then made by aggregating the predictions from these individual models. Essentially, it is a voting for the best answer: the final class prediction is the class that was predicted by the majority of votes.

We will use decision trees as our base models. Capable of supporting both categorical and continuous input variables, decision trees follow sequential, hierarchical decision rules to output a final decision.

Python Implementation and Code Walkthrough: Initializations and Data Loading

Our Python implementation calls upon several libraries, such as numpy for advanced mathematical computations on multi-dimensional arrays, sklearn for providing machine learning and statistical modeling tools, and scipy for statistical functions.

First, we load our dataset. For this lesson, we use the widely recognized iris dataset, which we split into training and test data. The iris dataset is popular in data science and machine learning. It contains measurements of 150 iris flowers from three different species - setosa, versicolor and virginica. The measurements include the lengths and the widths of the sepals and petals of the flowers.

The variable n_models, set here as 100, determines the number of decision tree classifiers we plan to build.

Python
1import numpy as np
2from scipy import stats
3from sklearn import datasets
4from sklearn.metrics import accuracy_score
5from sklearn.model_selection import train_test_split
6from sklearn.tree import DecisionTreeClassifier
7
8# Load the data
9iris = datasets.load_iris()
10X = iris.data
11y = iris.target
12
13# Split the data into train and test sets
14X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
15
16# Parameters
17n_models = 100
18random_states = [i for i in range(n_models)]

Python Implementation and Code Walkthrough: Bagging Algorithm

Next, we define our helper functions, bootstrapping and predict, which are pivotal to constructing our bagging model.

Subsequently, we iteratively train our decision tree models, make predictions, and calculate the model's accuracy using sklearn's accuracy_score() function.

Python
1# Helper function for bootstrapping
2def bootstrapping(X, y):
3    n_samples = X.shape[0]
4    idxs = np.random.choice(n_samples, n_samples, replace=True)
5    return X[idxs], y[idxs]

The bootstrapping function generates bootstrapped datasets, choosing random subsets from the data.

Python
1# Helper function for bagging prediction
2def predict(X, models):
3    predictions = np.array([model.predict(X) for model in models])
4    predictions = stats.mode(predictions)[0]
5    return predictions

The predict function consolidates predictions from various trained models to deliver the final decision. We use mode (the most frequent prediction) as the final answer.

Python
1# Create a list to store all the tree models
2tree_models = []
3
4# Iteratively train decision trees on bootstrapped samples
5for i in range(n_models):
6    X_, y_ = bootstrapping(X_train, y_train)
7    tree = DecisionTreeClassifier(max_depth=2, random_state=random_states[i])
8    tree.fit(X_, y_)
9    tree_models.append(tree)
10
11# Predict on the test set
12y_pred = predict(X_test, tree_models)
13
14# Print the accuracy
15print("Accuracy: ", accuracy_score(y_test, y_pred))

We can freely use another model instead of decision trees; they are chosen as an example.

Model Evaluation, Lesson Summary, and Practice

After implementing a model, we must evaluate its performance. In our bagging model, the accuracy score serves as a performance metric: the ratio of correct predictions to the total number of predictions. We utilize sklearn's accuracy_score() function to calculate this metric and gauge the performance of our model.

Congratulations! You've successfully navigated the basics of bagging with decision trees. You've learned about the fundamentals of bagging, implemented a bagging algorithm using decision trees in Python, and assessed the model's accuracy. Your understanding of these concepts will be further solidified through exercises in the next section. Have fun practicing!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.