Welcome! In today's lesson, we'll explore Boosting, focusing on AdaBoost
. Boosting improves model accuracy by combining weak models. By the end, you'll understand AdaBoost
and how to use it to improve your machine learning models.
Boosting increases model accuracy by combining weak models. Think of a group of not-so-great basketball players; individually, they may not win, but together they can be strong.
AdaBoost
(Adaptive Boosting) combines several weak classifiers into a strong one. A weak classifier is slightly better than guessing. AdaBoost
focuses on correcting errors made by previous classifiers. Here's how it works:
Before training our model, we need data. We'll use the wine dataset, which contains chemical properties of wines. This data helps us train and test our model.
To load the dataset, use load_wine
from sklearn.datasets
, which returns features X
and labels y
. Features describe the properties, while labels indicate the type of wine.
Python1from sklearn.datasets import load_wine 2 3# Load dataset 4X, y = load_wine(return_X_y=True)
Next, we split the data into training and testing sets using train_test_split
from sklearn.model_selection
. We use 80% for training and 20% for testing.
Python1from sklearn.model_selection import train_test_split 2 3# Split dataset 4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, let's train our AdaBoost
model using AdaBoostClassifier
from sklearn.ensemble
. We’ll use DecisionTreeClassifier
from sklearn.tree
as the weak classifier. In this case, each decision tree will have just one node.
Python1from sklearn.ensemble import AdaBoostClassifier 2from sklearn.tree import DecisionTreeClassifier 3 4# Train AdaBoost classifier 5ada_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, algorithm='SAMME') 6ada_clf.fit(X_train, y_train) 7 8# Make predictions 9y_pred_ada = ada_clf.predict(X_test)
In the code:
base_estimator=DecisionTreeClassifier()
specifies the weak classifier.n_estimators=100
combines 100 weak classifiers.algorithm='SAMME'
specifiers what algorithm to use. There is essentially only one option, which is 'SAMME'
. If it is not set, the program will use another algorithm, called 'SAMME.R'
. However, this algorithm is deprecated and will be removed in the future versions of sklearn, so you shouldn't use it. Always specify algorithm='SAMME'
when using the AdaBoost classifier.fit(X_train, y_train)
trains the model.predict(X_test)
makes predictions on the test set.To understand the effectiveness of AdaBoost
, let’s compare it with RandomForestClassifier
from sklearn.ensemble
.
Python1from sklearn.ensemble import RandomForestClassifier 2from sklearn.metrics import accuracy_score 3 4# Train RandomForest classifier 5rf_clf = RandomForestClassifier(n_estimators=100, random_state=42) 6rf_clf.fit(X_train, y_train) 7 8# Make predictions with RandomForest 9y_pred_rf = rf_clf.predict(X_test) 10 11# Calculate and compare accuracies 12accuracy_ada = accuracy_score(y_test, y_pred_ada) 13accuracy_rf = accuracy_score(y_test, y_pred_rf) 14 15print(f"AdaBoost accuracy: {accuracy_ada}") # 0.94 16print(f"RandomForest accuracy: {accuracy_rf}") # 1.0
In this code we initialize the RandomForestClassifier
with 100 trees, which we already know to perform perfectly on this dataset. Then, we make prediction with Random Forest and compare accuracies of the random forest and the AdaBoost models.
In this case AdaBoost shows slightly lower performance, but its accuracy is still very high and outperforms simple models.
Great job! You've learned about Boosting and how AdaBoost
uses weak classifiers to create a strong model. We covered:
AdaBoost
are.AdaBoost
classifier with decision trees.AdaBoost
and RandomForest
.Next, you'll practice by loading data, splitting it, and training your own AdaBoost
classifier. Ready to boost your skills? Let's dive in!