Addressing Data Leakage in Time Series

Intro to Machine Learning in Trading with $TSLA

Preparing Financial Data for Machine LearningLesson 4

Lesson 4

Addressing Data Leakage in Time Series

Lesson Overview

Welcome to today's lesson on addressing data leakage in time series data while preparing it for machine learning. In this lesson, you'll learn the importance of maintaining temporal order in your dataset splits to avoid forward-looking bias, which can misleadingly inflate your model's performance. We'll be using the Tesla ($TSLA) stock data as an example. By the end of this lesson, you'll understand how to partition your dataset correctly using TimeSeriesSplit from the sklearn.model_selection library.

Introduction to Data Leakage in Time Series

Data leakage occurs when information from outside the training dataset inadvertently makes its way into the model. This is particularly problematic in time series data, where the natural temporal ordering is crucial. Data leakage can lead to overestimation of a model's performance because it allows information from the future to be used in making predictions about the past.

When dealing with stock market data, using future prices to predict past prices would artificially inflate a model's accuracy and yield unreliable predictions for actual trading strategies. Hence, it's important to ensure that our training and testing sets are separated in a way that respects the temporal nature of the data.

Revisiting Feature Engineering and Scaling (Revision)

Let's quickly revise how to engineer features and scale them. These steps are foundational for preparing your data for machine learning models.

Python
1import pandas as pd
2import datasets
3from sklearn.preprocessing import StandardScaler
4
5# Load the dataset
6data = datasets.load_dataset('codesignal/tsla-historic-prices')
7tesla_df = pd.DataFrame(data['train'])
8
9# Feature Engineering: creating new features
10tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
11tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']
12
13# Defining features and target
14features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
15target = tesla_df['Close'].values
16
17# Scaling
18scaler = StandardScaler()
19features_scaled = scaler.fit_transform(features)

In this snippet, we create two new features, High-Low and Price-Open, and scale these features using StandardScaler.

Correctly Splitting Time Series Data

To avoid data leakage in time series, we need to split our data so that future data points are not used to predict past data points. TimeSeriesSplit from the sklearn.model_selection library helps achieve this.

The TimeSeriesSplit class helps you create train/test splits that respect the temporal order of your data. One of the key arguments in TimeSeriesSplit is n_splits, which specifies the number of re-shuffling and splitting iterations. Essentially, this determines how many different train/test splits will be generated from your data.

Python
1from sklearn.model_selection import TimeSeriesSplit
2
3# Initiate TimeSeriesSplit
4tscv = TimeSeriesSplit(n_splits=3)
5
6# Splitting with TimeSeriesSplit
7for fold, (train_index, test_index) in enumerate(tscv.split(features_scaled)):
8    print(f"Fold {fold + 1}")
9    print(f"TRAIN indices (first 5): {train_index[:5]}, TEST indices (first 5): {test_index[:5]}")
10    
11    # Splitting the features and target
12    X_train, X_test = features_scaled[train_index], features_scaled[test_index]
13    y_train, y_test = target[train_index], target[test_index]
14    
15    # Print a small sample of the data
16    print(f"X_train sample:\n {X_train[:2]}")
17    print(f"y_train sample:\n {y_train[:2]}")
18    print(f"X_test sample:\n {X_test[:2]}")
19    print(f"y_test sample:\n {y_test[:2]}")
20    print("-" * 10)

To elaborate, TimeSeriesSplit generates indices for multiple train/test splits, where the training set for each split consists of all data points up to a specific point in time, and the test set includes the subsequent data points in time. This sequential process respects the chronological order of the data. As a result, no future data points are included in the training set of any fold, which effectively prevents data leakage. This method ensures that our model training and evaluation simulate real-world scenarios more accurately, thereby providing reliable performance metrics.

Analyzing Fold Results

Let's analyze the output from each fold to ensure correct data splitting. The output of the above code will be:

Plain text
1Fold 1
2TRAIN indices (first 5): [0 1 2 3 4], TEST indices (first 5): [839 840 841 842 843]
3X_train sample:
4 [[-0.48165383  0.08560547  2.29693712]
5 [-0.48579183 -0.02912844  2.00292929]]
6y_train sample:
7 [1.592667 1.588667]
8X_test sample:
9 [[-0.4714307  -0.11890593  0.26304787]
10 [-0.42092366  0.03234206  1.43036618]]
11y_test sample:
12 [10.857333 10.964667]
13----------
14Fold 2
15TRAIN indices (first 5): [0 1 2 3 4], TEST indices (first 5): [1675 1676 1677 1678 1679]
16X_train sample:
17 [[-0.48165383  0.08560547  2.29693712]
18 [-0.48579183 -0.02912844  2.00292929]]
19y_train sample:
20 [1.592667 1.588667]
21X_test sample:
22 [[-0.46169462 -0.13046308  1.57995793]
23 [-0.47447336  0.07639316  0.32446706]]
24y_test sample:
25 [17.066    17.133333]
26----------
27Fold 3
28TRAIN indices (first 5): [0 1 2 3 4], TEST indices (first 5): [2511 2512 2513 2514 2515]
29X_train sample:
30 [[-0.48165383  0.08560547  2.29693712]
31 [-0.48579183 -0.02912844  2.00292929]]
32y_train sample:
33 [1.592667 1.588667]
34X_test sample:
35 [[-0.27268857 -0.19528365  0.41906266]
36 [-0.34291165 -0.09059793 -0.01236106]]
37y_test sample:
38 [66.726669 66.288002]
39----------

This output confirms the correct operation of TimeSeriesSplit, showing how each set of training and testing indices progresses through the data without overlap, respecting the temporal order. This ensures that no future data is used when training the model.

Best Practices in Time Series Data Splitting

Summarizing the key points:

Always maintain temporal order when splitting time series data.
Use TimeSeriesSplit to avoid data leakage.
Verify the indices to ensure no future data is used in training.

Adhering to these practices ensures the reliability of your model's performance metrics and the validity of your predictions for real-world scenarios.

Lesson Summary

In this lesson, you learned the importance of avoiding data leakage in time series datasets by using TimeSeriesSplit. With this method, you can maintain the integrity of your machine learning models and ensure their predictions are trustworthy for real-world financial trading tasks. Practicing these concepts and techniques will solidify your understanding and prepare you for more advanced machine-learning challenges.

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.