Lesson 5
Using Early Stopping to Prevent Overfitting in Gradient Boosting Models
Lesson Overview

Hello and welcome! In today's lesson, we will explore the practice of Using Early Stopping to Prevent Overfitting. This technique is essential in ensuring that your gradient boosting models stay robust and accurate. We will introduce early stopping, revise data preparation steps, implement early stopping in a gradient boosting model, evaluate its performance, and visualize the predictions vs. actual values.

By the end of this lesson, you will understand how to effectively use early stopping to manage overfitting in your models, especially within the context of financial data.

Introduction to Early Stopping

Early stopping is a regularization technique used to prevent overfitting in machine learning models, particularly those that learn iteratively, like gradient boosting models. It works by monitoring the model's performance on a validation set during training and halting the training process when no significant improvement is observed over a specified number of iterations.

Overfitting occurs when a model learns the noise in the training data rather than the actual signal, resulting in poor generalization to new, unseen data. Early stopping can help mitigate this by terminating the training process before the model becomes too specialized in the training data.

Why Early Stopping?

  • It enhances model generalization.
  • Reduces training time by halting unproductive iterations.
  • Helps manage resources efficiently.
Revising Data Preparation Steps

Given that you already know how to load, prepare, and scale features, let's do a quick revision. We'll use the load_dataset function to load the TSLA dataset, create new features, and standardize them.

Python
1import pandas as pd 2from datasets import load_dataset 3from sklearn.preprocessing import StandardScaler 4from sklearn.model_selection import train_test_split 5 6# Load TSLA dataset 7tesla = load_dataset('codesignal/tsla-historic-prices') 8tesla_df = pd.DataFrame(tesla['train']) 9 10# Convert Date column to datetime type 11tesla_df['Date'] = pd.to_datetime(tesla_df['Date']) 12 13# Feature Engineering 14tesla_df['Prev_Close'] = tesla_df['Adj Close'].shift(1) 15tesla_df['Day_Pct_Change'] = (tesla_df['Adj Close'] - tesla_df['Prev_Close']) / tesla_df['Prev_Close'] * 100 16tesla_df['SMA_5'] = tesla_df['Adj Close'].rolling(window=5).mean() 17tesla_df['SMA_10'] = tesla_df['Adj Close'].rolling(window=10).mean() 18tesla_df['EMA_5'] = tesla_df['Adj Close'].ewm(span=5, adjust=False).mean() 19tesla_df['EMA_10'] = tesla_df['Adj Close'].ewm(span=10, adjust=False).mean() 20tesla_df.dropna(inplace=True) 21 22# Select features and target 23features = tesla_df[['Open', 'High', 'Low', 'Close', 'Volume', 'SMA_5', 'SMA_10', 'EMA_5', 'EMA_10']].values 24target = tesla_df['Day_Pct_Change'].values 25 26# Standardizing features 27scaler = StandardScaler() 28features_scaled = scaler.fit_transform(features) 29 30# Split the dataset into training and testing sets 31X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)

Here, we load the dataset, create new features (previous close, day percentage change, simple moving averages, exponential moving averages), and standardize features. Finally, we split the dataset into training and testing sets.

Implementing Early Stopping in Gradient Boosting

Let's now incorporate early stopping into our Gradient Boosting model. The essential parameters for early stopping include validation_fraction, n_iter_no_change, and tol.

  • validation_fraction: The fraction of the data to be used as a validation set for early stopping.
  • n_iter_no_change: Number of iterations with no improvement to wait before stopping the training.
  • tol: The minimum improvement is to be considered significant.

Here's how we instantiate and train the model with early stopping:

Python
1from sklearn.ensemble import GradientBoostingRegressor 2 3# Instantiate the model with early stopping 4model = GradientBoostingRegressor(n_estimators=100, validation_fraction=0.1, 5 n_iter_no_change=5, tol=0.01, random_state=42) 6 7# Fit the model 8model.fit(X_train, y_train) 9 10# Predict and evaluate 11predictions = model.predict(X_test)
Evaluating Model Performance with and without Early Stopping

We'll use Mean Squared Error (MSE) to evaluate our model's performance. Lower MSE values indicate better model performance. Let's compare the MSE values with and without early stopping.

First, we'll calculate the MSE for our model with early stopping:

Python
1from sklearn.metrics import mean_squared_error 2 3# Calculate MSE 4mse = mean_squared_error(y_test, predictions) 5print("Mean Squared Error with Early Stopping:", mse) 6# Output: 7# Mean Squared Error with Early Stopping: 12.433090244316602

This output indicates the model's average squared difference between the estimated values and the actual value, providing a simple measure of the model's prediction accuracy.

Now, let's train a model without early stopping and compare the MSE:

Python
1# Instantiate the model without early stopping 2model_no_stop = GradientBoostingRegressor(n_estimators=100, random_state=42) 3 4# Fit the model 5model_no_stop.fit(X_train, y_train) 6 7# Predict and evaluate 8predictions_no_stop = model_no_stop.predict(X_test) 9mse_no_stop = mean_squared_error(y_test, predictions_no_stop) 10print("Mean Squared Error without Early Stopping:", mse_no_stop) 11# Output: 12# Mean Squared Error without Early Stopping: 11.456288894627543

This result shows that the model without early stopping performed slightly better in this instance, but this might vary with different datasets and model parameters. It's essential to evaluate models comprehensively before choosing the best approach.

Visualizing Predictions vs. Actual Values

Visualizing the predictions against the actual values helps in understanding how well the model performs. Let's plot these values for our model with early stopping.

Python
1import matplotlib.pyplot as plt 2 3plt.figure(figsize=(10, 6)) 4plt.scatter(range(len(y_test)), y_test, label='Actual', alpha=0.7) 5plt.scatter(range(len(y_test)), predictions, label='Predicted', alpha=0.7) 6plt.title('Actual vs Predicted Values with Early Stopping') 7plt.xlabel('Sample Index') 8plt.ylabel('Value') 9plt.legend() 10plt.show()

And for comparison, let's also visualize the predictions from the model without early stopping:

Python
1plt.figure(figsize=(10, 6)) 2plt.scatter(range(len(y_test)), y_test, label='Actual', alpha=0.7) 3plt.scatter(range(len(y_test)), predictions_no_stop, label='Predicted', alpha=0.7) 4plt.title('Actual vs Predicted Values without Early Stopping') 5plt.xlabel('Sample Index') 6plt.ylabel('Value') 7plt.legend() 8plt.show()

Lesson Summary

In this lesson, we explored how early stopping can prevent overfitting in gradient boosting models. We revised essential steps of data preparation, implemented early stopping, evaluated model performance using Mean Squared Error, and visualized the predictions. Early stopping helps improve the model's generalization ability and prevents wasting computational resources on unnecessary iterations.

Next, you will practice applying early stopping to your predictive models. This hands-on practice will solidify your understanding and enhance your skill set in machine learning for financial trading. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.