Welcome! In today's lesson, we'll explore Boosting, focusing on AdaBoost
. Boosting improves model accuracy by combining weak models. By the end, you'll understand AdaBoost
and how to use it to improve your machine learning models.
Boosting increases model accuracy by combining weak models. Think of a group of not-so-great basketball players; individually, they may not win, but together they can be strong.
AdaBoost
(Adaptive Boosting) combines several weak classifiers into a strong one. A weak classifier is slightly better than guessing. AdaBoost
focuses on correcting errors made by previous classifiers. Here's how it works:
- Initialize Weights: Assign equal weights to all training samples.
- Train Weak Classifier: Train a weak classifier on the weighted data.
- Calculate Error: Compute the classification error of the weak classifier.
- Update Weights: Increase the weights of misclassified samples and decrease the weights of correctly classified samples. This ensures that subsequent classifiers focus more on the difficult samples.
- Combine Classifiers: Combine all the weak classifiers to form a strong classifier, with each classifier's vote weighted according to its accuracy.
Before training our model, we need data. We'll use the wine dataset, which contains chemical properties of wines. This data helps us train and test our model.
To load the dataset, use load_wine
from sklearn.datasets
, which returns features X
and labels y
. Features describe the properties, while labels indicate the type of wine.
Python1from sklearn.datasets import load_wine 2 3# Load dataset 4X, y = load_wine(return_X_y=True)
Next, we split the data into training and testing sets using train_test_split
from sklearn.model_selection
. We use 80% for training and 20% for testing.
Python1from sklearn.model_selection import train_test_split 2 3# Split dataset 4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, let's train our AdaBoost
model using AdaBoostClassifier
from sklearn.ensemble
. We’ll use DecisionTreeClassifier
from sklearn.tree
as the weak classifier. In this case, each decision tree will have just one node.
Python1from sklearn.ensemble import AdaBoostClassifier 2from sklearn.tree import DecisionTreeClassifier 3 4# Train AdaBoost classifier 5ada_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, algorithm='SAMME') 6ada_clf.fit(X_train, y_train) 7 8# Make predictions 9y_pred_ada = ada_clf.predict(X_test)
In the code:
base_estimator=DecisionTreeClassifier()
specifies the weak classifier.n_estimators=100
combines 100 weak classifiers.algorithm='SAMME'
specifiers what algorithm to use. There is essentially only one option, which is'SAMME'
. If it is not set, the program will use another algorithm, called'SAMME.R'
. However, this algorithm is deprecated and will be removed in the future versions of sklearn, so you shouldn't use it. Always specifyalgorithm='SAMME'
when using the AdaBoost classifier.fit(X_train, y_train)
trains the model.predict(X_test)
makes predictions on the test set.
To understand the effectiveness of AdaBoost
, let’s compare it with RandomForestClassifier
from sklearn.ensemble
.
Python1from sklearn.ensemble import RandomForestClassifier 2from sklearn.metrics import accuracy_score 3 4# Train RandomForest classifier 5rf_clf = RandomForestClassifier(n_estimators=100, random_state=42) 6rf_clf.fit(X_train, y_train) 7 8# Make predictions with RandomForest 9y_pred_rf = rf_clf.predict(X_test) 10 11# Calculate and compare accuracies 12accuracy_ada = accuracy_score(y_test, y_pred_ada) 13accuracy_rf = accuracy_score(y_test, y_pred_rf) 14 15print(f"AdaBoost accuracy: {accuracy_ada}") # 0.94 16print(f"RandomForest accuracy: {accuracy_rf}") # 1.0
In this code we initialize the RandomForestClassifier
with 100 trees, which we already know to perform perfectly on this dataset. Then, we make prediction with Random Forest and compare accuracies of the random forest and the AdaBoost models.
In this case AdaBoost shows slightly lower performance, but its accuracy is still very high and outperforms simple models.
Great job! You've learned about Boosting and how AdaBoost
uses weak classifiers to create a strong model. We covered:
- What Boosting and
AdaBoost
are. - Loading the wine dataset.
- Splitting the dataset.
- Training an
AdaBoost
classifier with decision trees. - Comparing the accuracies of
AdaBoost
andRandomForest
.
Next, you'll practice by loading data, splitting it, and training your own AdaBoost
classifier. Ready to boost your skills? Let's dive in!