Evaluating Model with Cross-Validation

Lesson 2

Lesson Overview

Welcome to today's lesson on Evaluating Model with Cross-Validation! Our goal is to understand how to reliably assess the performance of our Gradient Boosting model using cross-validation techniques. This lesson will guide you through a quick review of data preparation, introduce the concept and importance of cross-validation, demonstrate implementing cross-validation with the cross_val_score function, and visualize model predictions to better understand the model's performance.

Review of Data Preparation

Before we dive into evaluating our model with cross-validation, let's quickly review the data preparation steps we performed. This will ensure that we're on the same page regarding the dataset and features we're using.

First, we loaded the Tesla ($TSLA) historical prices dataset:

Python
1from datasets import load_dataset
2import pandas as pd
3
4# Load dataset
5tesla = load_dataset('codesignal/tsla-historic-prices')
6tesla_df = pd.DataFrame(tesla['train'])
7
8# Convert Date column to datetime type
9tesla_df['Date'] = pd.to_datetime(tesla_df['Date'])

Next, we performed feature engineering to add technical indicators and the target variable:

Python
1# Feature Engineering
2tesla_df['Target'] = tesla_df['Adj Close'].shift(-1) - tesla_df['Adj Close']
3tesla_df['SMA_5'] = tesla_df['Adj Close'].rolling(window=5).mean()
4tesla_df['SMA_10'] = tesla_df['Adj Close'].rolling(window=10).mean()
5tesla_df['EMA_5'] = tesla_df['Adj Close'].ewm(span=5, adjust=False).mean()
6tesla_df['EMA_10'] = tesla_df['Adj Close'].ewm(span=10, adjust=False).mean()
7
8# Drop NaN values created by moving averages
9tesla_df.dropna(inplace=True)

Finally, we selected our features and target, and standardized the features:

Python
1from sklearn.preprocessing import StandardScaler
2
3# Select features and target
4features = tesla_df[['Open', 'High', 'Low', 'Close', 'Volume', 'SMA_5', 'SMA_10', 'EMA_5', 'EMA_10']].values
5target = tesla_df['Target'].values
6
7# Standardizing features
8scaler = StandardScaler()
9features_scaled = scaler.fit_transform(features)

This brings us to the prepared data that we'll use for model training and evaluation.

Introduction to Cross-Validation

Cross-validation is a key technique in evaluating the performance of machine learning models. It helps in assessing how well our model generalizes to an independent dataset. By using cross-validation, we minimize the risk of overfitting and ensure our model's robustness.

In K-Fold Cross-Validation, we split our dataset into k portions (folds). The model is trained on k - 1 folds and tested on the remaining fold. This process is repeated k times, each time using a different fold as the test set. The scores from each fold are then averaged to get a more reliable performance estimate.

Here's a quick explanation of how K-Fold Cross-Validation works:

First, we split data into k folds
Then we train on k - 1 folds and test on the remaining fold
We repeat this k times, each time with a different fold as the test set
Finally, we take the average of the results from each fold

We will use the cross_val_score function from sklearn.model_selection to perform cross-validation efficiently.

Implementing Cross-Validation

Let's move on to implementing cross-validation with our Gradient Boosting model. We'll set up the model and use 5-fold cross-validation to evaluate its performance.

Start by importing the necessary functions and setting up the model:

Python
1from sklearn.model_selection import cross_val_score
2from sklearn.ensemble import GradientBoostingRegressor
3
4# Instantiate model
5model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

Next, perform cross-validation and print the mean score:

Python
1# Perform cross-validation
2# The scoring parameter defaults to the negative mean absolute error
3# for regression models, hence the negative scores.
4scores = cross_val_score(model, features_scaled, target, cv=5, scoring='neg_mean_absolute_error')
5
6# Convert negative mean absolute error to positive for easier interpretation
7mean_score = -scores.mean()
8print("Mean cross-validation score (Mean Absolute Error): ", mean_score)
9# Output:
10# Mean cross-validation score (Mean Absolute Error):   2.306462976736652

In this code, the cross_val_score function performs 5-fold cross-validation on the model using Mean Absolute Error (MAE) as the scoring metric (scoring='neg_mean_absolute_error'). MAE measures the average absolute difference between predicted and actual values, with lower MAE values indicating better model performance because fewer errors mean better predictions. However, since cross_val_score is designed to maximize scores, it returns the negative of the MAE to fit this convention. The mean of these negative MAE scores gives an overall measure of the model's accuracy across different data splits, and taking the negative of this mean provides the actual MAE.

Visualizing Model Predictions

Visualizing the model's predictions against actual values is crucial for understanding how well the model is performing. Let’s fit the model to our entire dataset and visualize its predictions.

Fit the model to the data:

Python
1# Fit model to visualize predictions
2model.fit(features_scaled, target)
3predictions = model.predict(features_scaled)

Now, let's create a scatter plot comparing the actual values to the predicted values:

Python
1import matplotlib.pyplot as plt
2
3# Plotting predictions vs actual values
4plt.figure(figsize=(10, 6))
5plt.scatter(range(len(target)), target, label='Actual', alpha=0.7)
6plt.scatter(range(len(target)), predictions, label='Predicted', alpha=0.7)
7plt.title('Actual vs Predicted Values with Cross-Validation')
8plt.xlabel('Sample Index')
9plt.ylabel('Value')
10plt.legend()
11plt.show()

This plot will help us visually assess how close our model's predictions are to the actual target values, providing another layer of model evaluation.

Lesson Summary

In this lesson, we covered the following:

Reviewed the data preparation steps.
Introduced the concept and importance of cross-validation.
Implemented 5-fold cross-validation to evaluate our Gradient Boosting model.
Visualized the model’s predictions against actual values.

Cross-validation is a powerful tool to ensure your model's reliability and generalization. Visualizing the results helps in understanding the model’s performance better.

Practice these techniques by applying cross-validation to different models and datasets, and explore changing the number of folds in cross-validation to see how it affects the performance. These exercises will help you better understand the importance of cross-validation and improve your machine learning skills.

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.