Lesson 2
Grid Search: Finding Optimal Model Parameters
Lesson Introduction

Welcome! Today, we're going to learn about an exciting and powerful tool in machine learning called Grid Search. Imagine trying to find the perfect pair of shoes that fit just right. Grid Search does something similar but for tuning machine learning models. By the end of this lesson, you'll understand how to use Grid Search to find the best settings (parameters) for your models.

The Concept of Grid Search

Imagine baking the perfect cake. You need to find the right proportions of sugar, flour, and baking soda. Grid Search does the same for machine learning models by trying different combinations of parameters to find the best one. Parameters are settings you can adjust to improve your model's performance. The right parameters can make your model more accurate.

The parameters we set when initializing the model are called hyperparameters. Finding the perfect combination of them is called hypertuning.

We have already done some hypertuning before in this course path using for loops. But writing a for loop each time can be laborious, especially if you must check multiple models with multiple hyperparameters each. So, it is time for us to learn about a special tool that automates this process!

Data Preprocessing

Let's implement Grid Search using Scikit-Learn.

First, load the libraries and the Wine dataset:

Python
1from sklearn.model_selection import GridSearchCV 2from sklearn.tree import DecisionTreeClassifier 3from sklearn.datasets import load_wine 4from sklearn.model_selection import train_test_split 5 6# Load real dataset 7X, y = load_wine(return_X_y=True) 8# Splitting the dataset 9X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, we're loading a dataset about different wine types. It contains data to help us classify different wine categories based on their attributes. We also split our data into a training set and a test set.

Parameter Grid

Next, let's define which parameters to test, similar to adjusting ingredients in a recipe. For DecisionTreeClassifier, try different values of max_depth (the maximum depth of the tree) and min_samples_split (the minimum number of samples required to split an internal node).

The Grid Search requires a parameter grid, defined as a dictionary, where keys are the model's hyperparameters, and the values are the lists of possible options. Let's define it:

Python
1# Defining the parameter grid 2param_grid = { 3 'max_depth': [3, 5, 7, 10], 4 'min_samples_split': [2, 5, 10] 5}

Here, we say that max_depth can be 3, 5, 7, or 10, and min_samples_split can be 2, 5, or 10.

Performing Grid Search

The GridSearchCV class tests all parameter combinations and uses 5-fold cross-validation (cv=5). We'll also specify the scoring parameter to use accuracy as the evaluation metric.

Python
1# Performing grid search 2grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, scoring='accuracy') 3grid_search.fit(X_train, y_train)

We're not just training one model but several models with different parameter combinations. The fit function handles this: the last code line fits our DecisionTreeClassifier model with the parameter grid. Grid Search tests all combinations and selects the best model.

Evaluating Results

After Grid Search, check the best parameters and the model's performance.

Python
1print(f"Best parameters: {grid_search.best_params_}") 2print(f"Best cross-validation score: {grid_search.best_score_}") 3# Best parameters: {'max_depth': 3, 'min_samples_split': 2} 4# Best cross-validation score: 0.9224137931034484

grid_search.best_params_ shows the best parameter combination, and grid_search.best_score_ provides the best cross-validation score. This tells us what was the best score of the best parameter combination.

Making Predictions and Calculating Metrics

After finding the best parameters, use the best estimator to predict on the testing set and calculate the accuracy.

Python
1from sklearn.metrics import accuracy_score 2 3# Making predictions on the testing set 4best_model = grid_search.best_estimator_ 5y_pred = best_model.predict(X_test) 6 7# Calculating the accuracy on the testing set 8test_accuracy = accuracy_score(y_test, y_pred) 9print(f"Test set accuracy: {test_accuracy}") 10# Test set accuracy: 0.9444444444444444

We make predictions on the test set using the model with the best parameters found by Grid Search, and then calculate the accuracy of these predictions.

Lesson Summary

We've made great progress! Here's a quick summary:

  1. What is Grid Search? It's a method to find the best parameters for your machine learning model.
  2. Why use it? Because the right parameters can make your model more accurate.
  3. How to use it with Scikit-Learn? Load a real dataset, define a parameter grid, split the dataset, perform Grid Search, train the model, evaluate the results, make predictions, and calculate the final accuracy.

Now, you're ready to move on to some practice exercises. You'll apply Grid Search to find the best parameters for your own machine learning models. Let's get started with the practice session!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.