Lesson 3

Boosting with AdaBoost in Machine Learning

Lesson Introduction

Welcome! In today's lesson, we'll explore Boosting, focusing on AdaBoost. Boosting improves model accuracy by combining weak models. By the end, you'll understand AdaBoost and how to use it to improve your machine learning models.

Introduction to Boosting and AdaBoost

Boosting increases model accuracy by combining weak models. Think of a group of not-so-great basketball players; individually, they may not win, but together they can be strong.

AdaBoost (Adaptive Boosting) combines several weak classifiers into a strong one. A weak classifier is slightly better than guessing. AdaBoost focuses on correcting errors made by previous classifiers. Here's how it works:

  1. Initialize Weights: Assign equal weights to all training samples.
  2. Train Weak Classifier: Train a weak classifier on the weighted data.
  3. Calculate Error: Compute the classification error of the weak classifier.
  4. Update Weights: Increase the weights of misclassified samples and decrease the weights of correctly classified samples. This ensures that subsequent classifiers focus more on the difficult samples.
  5. Combine Classifiers: Combine all the weak classifiers to form a strong classifier, with each classifier's vote weighted according to its accuracy.
Loading the Dataset and Splitting the Dataset

Before training our model, we need data. We'll use the wine dataset, which contains chemical properties of wines. This data helps us train and test our model.

To load the dataset, use load_wine from sklearn.datasets, which returns features X and labels y. Features describe the properties, while labels indicate the type of wine.

Python
1from sklearn.datasets import load_wine 2 3# Load dataset 4X, y = load_wine(return_X_y=True)

Next, we split the data into training and testing sets using train_test_split from sklearn.model_selection. We use 80% for training and 20% for testing.

Python
1from sklearn.model_selection import train_test_split 2 3# Split dataset 4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Training an AdaBoost Classifier

Now, let's train our AdaBoost model using AdaBoostClassifier from sklearn.ensemble. We’ll use DecisionTreeClassifier from sklearn.tree as the weak classifier. In this case, each decision tree will have just one node.

Python
1from sklearn.ensemble import AdaBoostClassifier 2from sklearn.tree import DecisionTreeClassifier 3 4# Train AdaBoost classifier 5ada_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, algorithm='SAMME') 6ada_clf.fit(X_train, y_train) 7 8# Make predictions 9y_pred_ada = ada_clf.predict(X_test)

In the code:

  • base_estimator=DecisionTreeClassifier() specifies the weak classifier.
  • n_estimators=100 combines 100 weak classifiers.
  • algorithm='SAMME' specifiers what algorithm to use. There is essentially only one option, which is 'SAMME'. If it is not set, the program will use another algorithm, called 'SAMME.R'. However, this algorithm is deprecated and will be removed in the future versions of sklearn, so you shouldn't use it. Always specify algorithm='SAMME' when using the AdaBoost classifier.
  • fit(X_train, y_train) trains the model.
  • predict(X_test) makes predictions on the test set.
Comparing AdaBoost with RandomForest

To understand the effectiveness of AdaBoost, let’s compare it with RandomForestClassifier from sklearn.ensemble.

Python
1from sklearn.ensemble import RandomForestClassifier 2from sklearn.metrics import accuracy_score 3 4# Train RandomForest classifier 5rf_clf = RandomForestClassifier(n_estimators=100, random_state=42) 6rf_clf.fit(X_train, y_train) 7 8# Make predictions with RandomForest 9y_pred_rf = rf_clf.predict(X_test) 10 11# Calculate and compare accuracies 12accuracy_ada = accuracy_score(y_test, y_pred_ada) 13accuracy_rf = accuracy_score(y_test, y_pred_rf) 14 15print(f"AdaBoost accuracy: {accuracy_ada}") # 0.94 16print(f"RandomForest accuracy: {accuracy_rf}") # 1.0

In this code we initialize the RandomForestClassifier with 100 trees, which we already know to perform perfectly on this dataset. Then, we make prediction with Random Forest and compare accuracies of the random forest and the AdaBoost models.

In this case AdaBoost shows slightly lower performance, but its accuracy is still very high and outperforms simple models.

Lesson Summary

Great job! You've learned about Boosting and how AdaBoost uses weak classifiers to create a strong model. We covered:

  • What Boosting and AdaBoost are.
  • Loading the wine dataset.
  • Splitting the dataset.
  • Training an AdaBoost classifier with decision trees.
  • Comparing the accuracies of AdaBoost and RandomForest.

Next, you'll practice by loading data, splitting it, and training your own AdaBoost classifier. Ready to boost your skills? Let's dive in!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.