Creating Lag Features for Time Series Prediction

Lesson 5

Lesson Overview

Hello! Today, we'll explore creating lag features for time series prediction using Tesla ($TSLA) stock data. Let's start by reviewing how to load the dataset and create basic features.

Reviewing Dataset and Basic Feature Creation

First, let's load the dataset and create new features based on existing columns such as High-Low and Price-Open.

Python
1import pandas as pd
2import datasets
3
4# Loading the dataset (revision)
5data = datasets.load_dataset('codesignal/tsla-historic-prices')
6tesla_df = pd.DataFrame(data['train'])
7
8# Creating basic features (revision)
9tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
10tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']
11
12# Displaying the DataFrame structure
13print(tesla_df.head())

Here, we calculate High-Low (the difference between the highest and lowest price of the day) and Price-Open (the difference between the closing and opening price) to create new features.

Introduction to Lag Features

Lag features are essential in time series prediction as they help capture temporal patterns in the data by generating new features from past values. Essentially, these features allow us to use past values to predict future ones.

For instance, predicting today's closing price of Tesla stock might depend on the previous day's closing price. Here, the previous day's closing price would be a lagged feature.

Creating and Implementing Lag Features

Let's see how to create lag features using the shift() method in Pandas. We will add a new column, Close_lag1, to capture the previous day’s closing price.

Python
1# Creating a lag feature
2tesla_df['Close_lag1'] = tesla_df['Close'].shift(1)
3
4# Displaying a sample of the DataFrame with the lag feature
5print(tesla_df[['Close', 'Close_lag1']].head())

The output of the above code will be:

Plain text
1      Close  Close_lag1
20  1.592667         NaN
31  1.588667    1.592667
42  1.464000    1.588667
53  1.280000    1.464000
64  1.074000    1.280000

This output shows how the Close_lag1 column shifts the Close column values down by one row, making the first row's Close_lag1 value NaN because there is no previous row to shift from.

By using shift(1), we shift the closing price values down by one row, effectively capturing the previous day's closing price in a new column.

Handling NaN Values Resulting from Lag Features

Introducing lag features usually results in NaN values since the first row doesn't have a previous day to refer to. Let's handle these NaN values by dropping them.

Python
1# Dropping NaN values
2tesla_df.dropna(inplace=True)
3
4# Verifying the removal of NaN values
5print(tesla_df[['Close', 'Close_lag1']].head())

The output of the above code will be:

Plain text
1      Close  Close_lag1
21  1.588667    1.592667
32  1.464000    1.588667
43  1.280000    1.464000
54  1.074000    1.280000
65  1.053333    1.074000

This output verifies the effective removal of NaN values resulting from the creation of lag features, with the dataset now cleaned and ready for further processing. Dropping the NaN values ensures that our dataset is clean and ready for model training.

Defining Features and Target Variables

Next, we'll define the features and target variables for our model. Our features will include Close_lag1, High-Low, Price-Open, and Volume. Our target variable will be the Close price.

Python
1# Defining features and the target
2features = tesla_df[['Close_lag1', 'High-Low', 'Price-Open', 'Volume']].values
3target = tesla_df['Close'].values
4
5# Displaying features and target
6print("Features (first 5 rows):\n", features[:5])
7print("Target (first 5 rows):\n", target[:5])

The output of the above code will be:

Plain text
1Features (first 5 rows):
2 [[ 1.592667e+00  4.746670e-01 -1.306660e-01  2.578065e+08]
3 [ 1.588667e+00  3.766670e-01 -2.026670e-01  1.232820e+08]
4 [ 1.464000e+00  2.926670e-01 -2.533330e-01  7.709700e+07]
5 [ 1.280000e+00  2.780000e-01 -2.593330e-01  1.030035e+08]
6 [ 1.074000e+00  1.100000e-01 -4.000000e-02  1.038255e+08]]
7Target (first 5 rows):
8 [1.588667 1.464    1.28     1.074    1.053333]

This output demonstrates the structured array format of features selected for the machine learning model training, including lag features and the immediate target values for the prediction. By defining these features and targets, we set up our dataset for machine learning models, enabling them to learn from past data to predict future stock prices.

Lesson Summary

In this lesson, you've learned how to create and implement lag features in the Tesla ($TSLA) stock dataset. Lag features are crucial for capturing temporal dependencies in time series data, significantly improving model performance. You reviewed the dataset, created new features, introduced lag features, handled NaN values, and defined features and target variables.

Next, practice creating lag features with different shifting intervals and explore their impact on predictive performance. This will deepen your understanding and enhance your skills in time series forecasting. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.