Hey there! ๐ In this lesson, we'll explore an exciting machine learning technique called "Stacking." You might wonder why weโre learning about stacking and how it helps us make better predictions. Well, imagine asking several experts for their opinions and combining them to make a decision. By the end of this lesson, you'll know how to implement and use stacking to boost model performance!
Let's dive into stacking. Stacking is an ensemble technique combining multiple models (base models) to produce a final prediction using another model (meta-model). Think of base models as chefs, and the meta-model as a food critic who tastes all the dishes and decides the final rating.
How Does Stacking Work?
Why stacking?
To get hands-on with stacking, we need data. We'll again use the digit dataset from scikit-learn
. This dataset contains images of digits used to predict what digit each image represents.
Python1from sklearn.datasets import load_digits 2from sklearn.model_selection import train_test_split 3 4# Load digit dataset 5X, y = load_digits(return_X_y=True) 6 7# Splitting the dataset 8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
We define our base models (chefs) and meta-model (food critic). Base models do the heavy lifting, while the meta-model combines predictions for the final output. We use RandomForestClassifier
and GradientBoostingClassifier
as base models and LogisticRegression
as our meta-model:
Python1from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier 2from sklearn.linear_model import LogisticRegression 3from sklearn.ensemble import StackingClassifier 4 5# Defining the base and meta models 6estimators = [ 7 ('rf', RandomForestClassifier(n_estimators=20, random_state=42)), 8 ('gb', GradientBoostingClassifier(n_estimators=20, random_state=42)) 9] 10stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression()) 11stack_clf.fit(X_train, y_train)
The estimators
list has tuples with each base model's name and the actual model. StackingClassifier
combines these estimators with the final_estimator
(meta-model).
For base models, avoid utilizing comparable models or models with similar hyperparameter settings, as this may not result in a significant performance gain.
After training, we can evaluate our model's performance by calculating the accuracy on the testing data:
Python1from sklearn.metrics import accuracy_score 2 3# Predict and calculate accuracy 4y_pred = stack_clf.predict(X_test) 5accuracy = accuracy_score(y_test, y_pred) 6print(f"Stacking Classifier Accuracy: {accuracy:.2f}") 7# Stacking Classifier Accuracy: 0.97
Finally, let's see how our stacking classifier stacks up against a single GradientBoostingClassifier
:
Python1# Training GradientBoostingClassifier 2gb_clf = GradientBoostingClassifier(n_estimators=50, random_state=42) 3gb_clf.fit(X_train, y_train) 4 5# Predict and calculate accuracy for GradientBoostingClassifier 6y_pred_gb = gb_clf.predict(X_test) 7accuracy_gb = accuracy_score(y_test, y_pred_gb) 8print(f"GradientBoosting Classifier Accuracy: {accuracy_gb:.2f}") 9# GradientBoosting Classifier Accuracy: 0.96
We see that the performance of the GradientBoostingClassifier is comparable to the performance of our new StackingMClassifier. But how do we choose one in this case? Well, this is what the next course of this path is all about!
However, let's answer here shortly:
train_test_split
.Great job! ๐ Let's recap:
RandomForestClassifier
and GradientBoostingClassifier
) and a meta-model (LogisticRegression
).StackingClassifier
and calculated its accuracy.GradientBoostingClassifier
.Awesome, you made it! Now it's time to practice. In the upcoming session, youโll build, train, and evaluate stacking models. Let's get started! ๐