Hey there! ๐ In this lesson, we'll explore an exciting machine learning technique called "Stacking." You might wonder why weโre learning about stacking and how it helps us make better predictions. Well, imagine asking several experts for their opinions and combining them to make a decision. By the end of this lesson, you'll know how to implement and use stacking to boost model performance!
Let's dive into stacking. Stacking is an ensemble technique combining multiple models (base models) to produce a final prediction using another model (meta-model). Think of base models as chefs, and the meta-model as a food critic who tastes all the dishes and decides the final rating.
How Does Stacking Work?
- Training Base Models: Training multiple base models on the same dataset. Each model brings its unique strength to capture different aspects of the data.
- Generating Meta-Data: Using the base models' predictions, we generate a new dataset (meta-data). This dataset is composed of the predictions of all base models.
- Training Meta-Model: Training a meta-model on this new meta-data. The meta-model learns how to best combine the predictions of the base models to make the final prediction.
Why stacking?
- Improved Accuracy: Combining different models captures various patterns and reduces errors.
- Reduced Overfitting: Multiple models balance biases and variances.
To get hands-on with stacking, we need data. We'll again use the digit dataset from scikit-learn
. This dataset contains images of digits used to predict what digit each image represents.
Python1from sklearn.datasets import load_digits 2from sklearn.model_selection import train_test_split 3 4# Load digit dataset 5X, y = load_digits(return_X_y=True) 6 7# Splitting the dataset 8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
We define our base models (chefs) and meta-model (food critic). Base models do the heavy lifting, while the meta-model combines predictions for the final output. We use RandomForestClassifier
and GradientBoostingClassifier
as base models and LogisticRegression
as our meta-model:
Python1from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier 2from sklearn.linear_model import LogisticRegression 3from sklearn.ensemble import StackingClassifier 4 5# Defining the base and meta models 6estimators = [ 7 ('rf', RandomForestClassifier(n_estimators=20, random_state=42)), 8 ('gb', GradientBoostingClassifier(n_estimators=20, random_state=42)) 9] 10stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression()) 11stack_clf.fit(X_train, y_train)
The estimators
list has tuples with each base model's name and the actual model. StackingClassifier
combines these estimators with the final_estimator
(meta-model).
For base models, avoid utilizing comparable models or models with similar hyperparameter settings, as this may not result in a significant performance gain.
After training, we can evaluate our model's performance by calculating the accuracy on the testing data:
Python1from sklearn.metrics import accuracy_score 2 3# Predict and calculate accuracy 4y_pred = stack_clf.predict(X_test) 5accuracy = accuracy_score(y_test, y_pred) 6print(f"Stacking Classifier Accuracy: {accuracy:.2f}") 7# Stacking Classifier Accuracy: 0.97
Finally, let's see how our stacking classifier stacks up against a single GradientBoostingClassifier
:
Python1# Training GradientBoostingClassifier 2gb_clf = GradientBoostingClassifier(n_estimators=50, random_state=42) 3gb_clf.fit(X_train, y_train) 4 5# Predict and calculate accuracy for GradientBoostingClassifier 6y_pred_gb = gb_clf.predict(X_test) 7accuracy_gb = accuracy_score(y_test, y_pred_gb) 8print(f"GradientBoosting Classifier Accuracy: {accuracy_gb:.2f}") 9# GradientBoosting Classifier Accuracy: 0.96
We see that the performance of the GradientBoostingClassifier is comparable to the performance of our new StackingMClassifier. But how do we choose one in this case? Well, this is what the next course of this path is all about!
However, let's answer here shortly:
- Firstly, we will try to find optimal models' hyperparameters and perhaps one of the models will outperform in this case
- Secondly, we will implement a better validation technique than simple
train_test_split
. - Finally, if two models still perform on the same level, we can consider other factors, like model's training time and interpretability.
Great job! ๐ Let's recap:
- Stacking: Combines multiple models (base models) using another model (meta-model) for a final prediction.
- How It Works: Includes training base models, generating meta-data, and training a meta-model.
- Loading and Splitting Data: Used the digit dataset and split it into training and testing sets.
- Model Setup: Defined base models (
RandomForestClassifier
andGradientBoostingClassifier
) and a meta-model (LogisticRegression
). - Training and Evaluating: Implemented and trained the
StackingClassifier
and calculated its accuracy. - Comparison: Compared the accuracy of the stacking classifier to that of a single
GradientBoostingClassifier
.
Awesome, you made it! Now it's time to practice. In the upcoming session, youโll build, train, and evaluate stacking models. Let's get started! ๐