Hyperparameter Tuning for Ensembles

Introduction to Machine Learning with SciKit Learn

Hypertuning and Cross-ValidationLesson 4

Lesson 4

Hyperparameter Tuning for Ensembles

Lesson Introduction

Welcome! Today, we'll explore "Hyperparameter Tuning for Ensembles." It might sound complex, but we'll break it down step by step.

In machine learning, models learn from data and make predictions. Ensembles combine the predictions of multiple models to improve accuracy. However, tuning these models' settings (hyperparameters) is key to getting the best performance.

By the end of this lesson, you'll understand:

What ensemble methods are.
How to apply GridSearch to tune hyperparameters for ensemble models, specifically using the AdaBoost algorithm with a DecisionTreeClassifier as the base estimator.

Recalling Ensemble Methods

Before diving into hyperparameter tuning, let's recall what ensemble methods are.

Ensemble methods use multiple models (base estimators) to make predictions. Think of them as a team of weather forecasters. Each forecaster gives their prediction, and then you combine all their predictions to get a more accurate forecast. Using ensemble methods improve model's performance and add robustness.

One popular ensemble method is AdaBoost (Adaptive Boosting), which improves model performance by combining multiple weak classifiers. Each of them focuses on errors of the previous models.

Setting Up the Dataset

Now, let's get hands-on by setting up our dataset. We'll use the wine dataset from Scikit-Learn. This dataset contains information about different types of wines.

We need to split our dataset into training and test sets to train our model and evaluate its performance.

Python
1from sklearn.datasets import load_wine
2from sklearn.model_selection import train_test_split
3
4# Load the wine dataset
5X, y = load_wine(return_X_y=True)
6print(X.shape, y.shape)  # Output: (178, 13) (178,)
7
8# Split into training and test sets
9X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
10print(X_train.shape, X_test.shape)  # Output: (142, 13) (36, 13)

Defining the Parameter Grid

Now, we need to define the hyperparameters we want to tune using GridSearch. This is called the parameter grid.

For AdaBoost, we can tune:

n_estimators: Number of boosting stages.
learning_rate: How much each model is influenced by the errors of the previous model.
estimator__max_depth: The depth of the tree when using a DecisionTreeClassifier.

Python
1param_grid = {
2    'n_estimators': [50, 100, 200],
3    'learning_rate': [0.01, 0.1, 1],
4    'estimator__max_depth': [1, 3, 5]
5}

This grid helps us test different combinations to find the best ones.

Initializing the Base Estimator

Next, we need to choose our base estimator. For this lesson, let's use a DecisionTreeClassifier.

Python
1from sklearn.tree import DecisionTreeClassifier
2from sklearn.ensemble import AdaBoostClassifier
3
4# Initialize the base estimator
5base_estimator = DecisionTreeClassifier()
6
7# Initialize the AdaBoost classifier using the base estimator
8ada_clf = AdaBoostClassifier(estimator=base_estimator)

By setting estimator=base_estimator, we are telling AdaBoost to use the decision tree as the base estimator.

Performing Grid Search

Now comes the exciting part: performing a GridSearch to tune the hyperparameters. We use GridSearchCV to search the hyperparameter grid.

GridSearchCV helps us find the best set of hyperparameters by systematically testing each combination.

Python
1from sklearn.model_selection import GridSearchCV
2
3# Set up GridSearchCV
4ada_grid_search = GridSearchCV(ada_clf, param_grid, cv=5)
5
6# Fit the model
7ada_grid_search.fit(X_train, y_train)

Interpreting Results

Finally, let's interpret the results to find the best hyperparameters and understand their impact on the model's performance.

Python
1print(f"Best parameters for AdaBoost: {ada_grid_search.best_params_}")
2print(f"Best cross-validation score for AdaBoost: {ada_grid_search.best_score_}")
3# Output:
4# Best parameters for AdaBoost: {'estimator__max_depth': 1, 'learning_rate': 0.1, 'n_estimators': 50}
5# Best cross-validation score for AdaBoost: 0.9617857142857144

This will print the combination of hyperparameters that performed the best during the GridSearch.

The best_params_ helps us understand which combination of hyperparameters gave the best performance. The best_score_ indicates how well the model performed during cross-validation.

Final Prediction and Evaluation

Now that we have the best hyperparameters, let's use them to make predictions on our testing set and evaluate the model's performance.

Python
1from sklearn.metrics import accuracy_score
2
3# Use the best estimator to make predictions on the test set
4best_ada_model = ada_grid_search.best_estimator_
5y_pred = best_ada_model.predict(X_test)
6
7# Calculate accuracy
8test_accuracy = accuracy_score(y_test, y_pred)
9print(f"Test set accuracy: {test_accuracy}")
10# Output: Test set accuracy: 1.0 (or whatever the accuracy is)

This code will help us understand how well our model generalizes to unseen data.

Lesson Summary

Great job on making it through the lesson! Today, we learned how to define a parameter grid for an ensemble model, perform hyperparameter tuning using GridSearchCV, and evaluate the model on a test set. Hyperparameter tuning is essential to improve the performance of your machine learning models, especially ensemble models like AdaBoost.

Now, it's time for you to apply what you've learned. You'll move to the practice section where you'll get hands-on experience with hyperparameter tuning for ensemble models. This practice will solidify your understanding and give you the confidence to use these techniques on your own projects. Good luck!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.