Stacking in Machine Learning

Lesson 5

Lesson Introduction

Hey there! 😊 In this lesson, we'll explore an exciting machine learning technique called "Stacking." You might wonder why we’re learning about stacking and how it helps us make better predictions. Well, imagine asking several experts for their opinions and combining them to make a decision. By the end of this lesson, you'll know how to implement and use stacking to boost model performance!

Introduction to Stacking

Let's dive into stacking. Stacking is an ensemble technique combining multiple models (base models) to produce a final prediction using another model (meta-model). Think of base models as chefs, and the meta-model as a food critic who tastes all the dishes and decides the final rating.

How Does Stacking Work?

Training Base Models: Training multiple base models on the same dataset. Each model brings its unique strength to capture different aspects of the data.
Generating Meta-Data: Using the base models' predictions, we generate a new dataset (meta-data). This dataset is composed of the predictions of all base models.
Training Meta-Model: Training a meta-model on this new meta-data. The meta-model learns how to best combine the predictions of the base models to make the final prediction.

Why stacking?

Improved Accuracy: Combining different models captures various patterns and reduces errors.
Reduced Overfitting: Multiple models balance biases and variances.

Loading and Splitting the Dataset

To get hands-on with stacking, we need data. We'll again use the digit dataset from scikit-learn. This dataset contains images of digits used to predict what digit each image represents.

Python
1from sklearn.datasets import load_digits
2from sklearn.model_selection import train_test_split
3
4# Load digit dataset
5X, y = load_digits(return_X_y=True)
6
7# Splitting the dataset
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Defining Base and Meta Models

We define our base models (chefs) and meta-model (food critic). Base models do the heavy lifting, while the meta-model combines predictions for the final output. We use RandomForestClassifier and GradientBoostingClassifier as base models and LogisticRegression as our meta-model:

Python
1from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
2from sklearn.linear_model import LogisticRegression
3from sklearn.ensemble import StackingClassifier
4
5# Defining the base and meta models
6estimators = [
7    ('rf', RandomForestClassifier(n_estimators=20, random_state=42)),
8    ('gb', GradientBoostingClassifier(n_estimators=20, random_state=42))
9]
10stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
11stack_clf.fit(X_train, y_train)

The estimators list has tuples with each base model's name and the actual model. StackingClassifier combines these estimators with the final_estimator (meta-model).

For base models, avoid utilizing comparable models or models with similar hyperparameter settings, as this may not result in a significant performance gain.

Calculating Accuracy on Testing Data

After training, we can evaluate our model's performance by calculating the accuracy on the testing data:

Python
1from sklearn.metrics import accuracy_score
2
3# Predict and calculate accuracy
4y_pred = stack_clf.predict(X_test)
5accuracy = accuracy_score(y_test, y_pred)
6print(f"Stacking Classifier Accuracy: {accuracy:.2f}")
7# Stacking Classifier Accuracy: 0.97

Comparison with GradientBoostingClassifier

Finally, let's see how our stacking classifier stacks up against a single GradientBoostingClassifier:

Python
1# Training GradientBoostingClassifier
2gb_clf = GradientBoostingClassifier(n_estimators=50, random_state=42)
3gb_clf.fit(X_train, y_train)
4
5# Predict and calculate accuracy for GradientBoostingClassifier
6y_pred_gb = gb_clf.predict(X_test)
7accuracy_gb = accuracy_score(y_test, y_pred_gb)
8print(f"GradientBoosting Classifier Accuracy: {accuracy_gb:.2f}")
9# GradientBoosting Classifier Accuracy: 0.96

We see that the performance of the GradientBoostingClassifier is comparable to the performance of our new StackingMClassifier. But how do we choose one in this case? Well, this is what the next course of this path is all about!

However, let's answer here shortly:

Firstly, we will try to find optimal models' hyperparameters and perhaps one of the models will outperform in this case
Secondly, we will implement a better validation technique than simple train_test_split.
Finally, if two models still perform on the same level, we can consider other factors, like model's training time and interpretability.

Lesson Summary

Great job! 🎉 Let's recap:

Stacking: Combines multiple models (base models) using another model (meta-model) for a final prediction.
How It Works: Includes training base models, generating meta-data, and training a meta-model.
Loading and Splitting Data: Used the digit dataset and split it into training and testing sets.
Model Setup: Defined base models (RandomForestClassifier and GradientBoostingClassifier) and a meta-model (LogisticRegression).
Training and Evaluating: Implemented and trained the StackingClassifier and calculated its accuracy.
Comparison: Compared the accuracy of the stacking classifier to that of a single GradientBoostingClassifier.

Awesome, you made it! Now it's time to practice. In the upcoming session, you’ll build, train, and evaluate stacking models. Let's get started! 🚀

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.