Lesson 1

Welcome to our exploration of Implementing Bagging. This lesson expands upon your machine learning toolkit by introducing you to the bagging technique and illustrating its use with decision trees. You will also gain hands-on experience with these concepts through a Python implementation. So, let's embark on our bagging adventure!

**Bagging**, or *bootstrap aggregating*, is a technique in ensemble learning that aims to reduce the variance of the machine learning model. The essence of bagging involves generating multiple subsets from the original dataset and then using these subsets to train separate models. Note that the subsets are chosen with replacement, so it is possible to have duplicate data points in a single subset. The final prediction is then made by aggregating the predictions from these individual models. Essentially, it is a **voting** for the best answer: the final class prediction is the class that was predicted by the majority of votes.

We will use *decision trees* as our base models. Capable of supporting both categorical and continuous input variables, decision trees follow sequential, hierarchical decision rules to output a final decision.

Our Python implementation calls upon several libraries, such as **numpy** for advanced mathematical computations on multi-dimensional arrays, **sklearn** for providing machine learning and statistical modeling tools, and **scipy** for statistical functions.

First, we load our dataset. For this lesson, we use the widely recognized iris dataset, which we split into training and test data. The iris dataset is popular in data science and machine learning. It contains measurements of 150 iris flowers from three different species - setosa, versicolor and virginica. The measurements include the lengths and the widths of the sepals and petals of the flowers.

The variable `n_models`

, set here as 100, determines the number of decision tree classifiers we plan to build.

Python`1import numpy as np 2from scipy import stats 3from sklearn import datasets 4from sklearn.metrics import accuracy_score 5from sklearn.model_selection import train_test_split 6from sklearn.tree import DecisionTreeClassifier 7 8# Load the data 9iris = datasets.load_iris() 10X = iris.data 11y = iris.target 12 13# Split the data into train and test sets 14X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) 15 16# Parameters 17n_models = 100 18random_states = [i for i in range(n_models)]`

Next, we define our helper functions, `bootstrapping`

and `predict`

, which are pivotal to constructing our `bagging`

model.

Subsequently, we iteratively train our decision tree models, make predictions, and calculate the model's accuracy using sklearn's `accuracy_score()`

function.

Python`1# Helper function for bootstrapping 2def bootstrapping(X, y): 3 n_samples = X.shape[0] 4 idxs = np.random.choice(n_samples, n_samples, replace=True) 5 return X[idxs], y[idxs]`

The `bootstrapping`

function generates bootstrapped datasets, choosing random subsets from the data.

Python`1# Helper function for bagging prediction 2def predict(X, models): 3 predictions = np.array([model.predict(X) for model in models]) 4 predictions = stats.mode(predictions)[0] 5 return predictions`

The `predict`

function consolidates predictions from various trained models to deliver the final decision. We use mode (the most frequent prediction) as the final answer.

Python`1# Create a list to store all the tree models 2tree_models = [] 3 4# Iteratively train decision trees on bootstrapped samples 5for i in range(n_models): 6 X_, y_ = bootstrapping(X_train, y_train) 7 tree = DecisionTreeClassifier(max_depth=2, random_state=random_states[i]) 8 tree.fit(X_, y_) 9 tree_models.append(tree) 10 11# Predict on the test set 12y_pred = predict(X_test, tree_models) 13 14# Print the accuracy 15print("Accuracy: ", accuracy_score(y_test, y_pred))`

We can freely use another model instead of decision trees; they are chosen as an example.

After implementing a model, we must evaluate its performance. In our bagging model, the *accuracy score* serves as a performance metric: the ratio of correct predictions to the total number of predictions. We utilize sklearn's `accuracy_score()`

function to calculate this metric and gauge the performance of our model.

Congratulations! You've successfully navigated the basics of bagging with decision trees. You've learned about the fundamentals of bagging, implemented a bagging algorithm using decision trees in Python, and assessed the model's accuracy. Your understanding of these concepts will be further solidified through exercises in the next section. Have fun practicing!