Welcome to today's lesson on Evaluating Model with Cross-Validation! Our goal is to understand how to reliably assess the performance of our Gradient Boosting model using cross-validation techniques. This lesson will guide you through a quick review of data preparation, introduce the concept and importance of cross-validation, demonstrate implementing cross-validation with the cross_val_score
function, and visualize model predictions to better understand the model's performance.
Before we dive into evaluating our model with cross-validation, let's quickly review the data preparation steps we performed. This will ensure that we're on the same page regarding the dataset and features we're using.
First, we loaded the Tesla ($TSLA
) historical prices dataset:
Python1from datasets import load_dataset 2import pandas as pd 3 4# Load dataset 5tesla = load_dataset('codesignal/tsla-historic-prices') 6tesla_df = pd.DataFrame(tesla['train']) 7 8# Convert Date column to datetime type 9tesla_df['Date'] = pd.to_datetime(tesla_df['Date'])
Next, we performed feature engineering to add technical indicators and the target variable:
Python1# Feature Engineering 2tesla_df['Target'] = tesla_df['Adj Close'].shift(-1) - tesla_df['Adj Close'] 3tesla_df['SMA_5'] = tesla_df['Adj Close'].rolling(window=5).mean() 4tesla_df['SMA_10'] = tesla_df['Adj Close'].rolling(window=10).mean() 5tesla_df['EMA_5'] = tesla_df['Adj Close'].ewm(span=5, adjust=False).mean() 6tesla_df['EMA_10'] = tesla_df['Adj Close'].ewm(span=10, adjust=False).mean() 7 8# Drop NaN values created by moving averages 9tesla_df.dropna(inplace=True)
Finally, we selected our features and target, and standardized the features:
Python1from sklearn.preprocessing import StandardScaler 2 3# Select features and target 4features = tesla_df[['Open', 'High', 'Low', 'Close', 'Volume', 'SMA_5', 'SMA_10', 'EMA_5', 'EMA_10']].values 5target = tesla_df['Target'].values 6 7# Standardizing features 8scaler = StandardScaler() 9features_scaled = scaler.fit_transform(features)
This brings us to the prepared data that we'll use for model training and evaluation.
Cross-validation is a key technique in evaluating the performance of machine learning models. It helps in assessing how well our model generalizes to an independent dataset. By using cross-validation, we minimize the risk of overfitting and ensure our model's robustness.
In K-Fold Cross-Validation, we split our dataset into k
portions (folds). The model is trained on k - 1
folds and tested on the remaining fold. This process is repeated k
times, each time using a different fold as the test set. The scores from each fold are then averaged to get a more reliable performance estimate.
Here's a quick explanation of how K-Fold Cross-Validation works:
k
foldsk - 1
folds and test on the remaining foldk
times, each time with a different fold as the test setWe will use the cross_val_score
function from sklearn.model_selection
to perform cross-validation efficiently.
Let's move on to implementing cross-validation with our Gradient Boosting model. We'll set up the model and use 5-fold cross-validation to evaluate its performance.
Start by importing the necessary functions and setting up the model:
Python1from sklearn.model_selection import cross_val_score 2from sklearn.ensemble import GradientBoostingRegressor 3 4# Instantiate model 5model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
Next, perform cross-validation and print the mean score:
Python1# Perform cross-validation 2# The scoring parameter defaults to the negative mean absolute error 3# for regression models, hence the negative scores. 4scores = cross_val_score(model, features_scaled, target, cv=5) 5mean_score = scores.mean() 6print("Mean cross-validation score: ", mean_score) 7# Output: 8# Mean cross-validation score: -0.21139860331328936
This negative score indicates that the model's predictions are generally poorer than simply predicting the mean target value. Such an outcome suggests the need for model improvement or reevaluation of the data preprocessing steps and feature selection.
Let's move on to implementing cross-validation with our Gradient Boosting model. We'll set up the model and use 5-fold cross-validation to evaluate its performance.
Start by importing the necessary functions and setting up the model:
Python1from sklearn.model_selection import cross_val_score 2from sklearn.ensemble import GradientBoostingRegressor 3 4# Instantiate model 5model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
Next, perform cross-validation and print the mean score:
Python1# Perform cross-validation 2# The scoring parameter defaults to the negative mean absolute error 3# for regression models, hence the negative scores. 4scores = cross_val_score(model, features_scaled, target, cv=5, scoring='neg_mean_absolute_error') 5 6# Convert negative mean absolute error to positive for easier interpretation 7mean_score = -scores.mean() 8print("Mean cross-validation score (Mean Absolute Error): ", mean_score) 9# Output: 10# Mean cross-validation score (Mean Absolute Error): -0.21139860331328936
This score indicates the Mean Absolute Error (MAE) of the model, which tells us the average absolute difference between predicted and actual values. A lower MAE indicates better predictive accuracy. In this case, the MAE suggests that on average the model's predictions deviate from the actual values by approximately 0.21139860331328936
units.
Visualizing the model's predictions against actual values is crucial for understanding how well the model is performing. Let’s fit the model to our entire dataset and visualize its predictions.
Fit the model to the data:
Python1# Fit model to visualize predictions 2model.fit(features_scaled, target) 3predictions = model.predict(features_scaled)
Now, let's create a scatter plot comparing the actual values to the predicted values:
Python1import matplotlib.pyplot as plt 2 3# Plotting predictions vs actual values 4plt.figure(figsize=(10, 6)) 5plt.scatter(range(len(target)), target, label='Actual', alpha=0.7) 6plt.scatter(range(len(target)), predictions, label='Predicted', alpha=0.7) 7plt.title('Actual vs Predicted Values with Cross-Validation') 8plt.xlabel('Sample Index') 9plt.ylabel('Value') 10plt.legend() 11plt.show()
This plot will help us visually assess how close our model's predictions are to the actual target values, providing another layer of model evaluation.
In this lesson, we covered the following:
Cross-validation is a powerful tool to ensure your model's reliability and generalization. Visualizing the results helps in understanding the model’s performance better.
Practice these techniques by applying cross-validation to different models and datasets, and explore changing the number of folds in cross-validation to see how it affects the performance. These exercises will help you better understand the importance of cross-validation and improve your machine learning skills.