Hello! Today, we'll explore creating lag features for time series prediction using Tesla ($TSLA) stock data. Let's start by reviewing how to load the dataset and create basic features.
First, let's load the dataset and create new features based on existing columns such as High-Low
and Price-Open
.
Python1import pandas as pd 2import datasets 3 4# Loading the dataset (revision) 5data = datasets.load_dataset('codesignal/tsla-historic-prices') 6tesla_df = pd.DataFrame(data['train']) 7 8# Creating basic features (revision) 9tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low'] 10tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open'] 11 12# Displaying the DataFrame structure 13print(tesla_df.head())
Here, we calculate High-Low
(the difference between the highest and lowest price of the day) and Price-Open
(the difference between the closing and opening price) to create new features.
Lag features are essential in time series prediction as they help capture temporal patterns in the data by generating new features from past values. Essentially, these features allow us to use past values to predict future ones.
For instance, predicting today's closing price of Tesla stock might depend on the previous day's closing price. Here, the previous day's closing price would be a lagged feature.
Let's see how to create lag features using the shift()
method in Pandas. We will add a new column, Close_lag1
, to capture the previous day’s closing price.
Python1# Creating a lag feature 2tesla_df['Close_lag1'] = tesla_df['Close'].shift(1) 3 4# Displaying a sample of the DataFrame with the lag feature 5print(tesla_df[['Close', 'Close_lag1']].head())
The output of the above code will be:
Plain text1 Close Close_lag1 20 1.592667 NaN 31 1.588667 1.592667 42 1.464000 1.588667 53 1.280000 1.464000 64 1.074000 1.280000
This output shows how the Close_lag1
column shifts the Close
column values down by one row, making the first row's Close_lag1
value NaN because there is no previous row to shift from.
By using shift(1)
, we shift the closing price values down by one row, effectively capturing the previous day's closing price in a new column.
Introducing lag features usually results in NaN values since the first row doesn't have a previous day to refer to. Let's handle these NaN values by dropping them.
Python1# Dropping NaN values 2tesla_df.dropna(inplace=True) 3 4# Verifying the removal of NaN values 5print(tesla_df[['Close', 'Close_lag1']].head())
The output of the above code will be:
Plain text1 Close Close_lag1 21 1.588667 1.592667 32 1.464000 1.588667 43 1.280000 1.464000 54 1.074000 1.280000 65 1.053333 1.074000
This output verifies the effective removal of NaN values resulting from the creation of lag features, with the dataset now cleaned and ready for further processing. Dropping the NaN values ensures that our dataset is clean and ready for model training.
Next, we'll define the features and target variables for our model. Our features will include Close_lag1
, High-Low
, Price-Open
, and Volume
. Our target variable will be the Close
price.
Python1# Defining features and the target 2features = tesla_df[['Close_lag1', 'High-Low', 'Price-Open', 'Volume']].values 3target = tesla_df['Close'].values 4 5# Displaying features and target 6print("Features (first 5 rows):\n", features[:5]) 7print("Target (first 5 rows):\n", target[:5])
The output of the above code will be:
Plain text1Features (first 5 rows): 2 [[ 1.592667e+00 4.746670e-01 -1.306660e-01 2.578065e+08] 3 [ 1.588667e+00 3.766670e-01 -2.026670e-01 1.232820e+08] 4 [ 1.464000e+00 2.926670e-01 -2.533330e-01 7.709700e+07] 5 [ 1.280000e+00 2.780000e-01 -2.593330e-01 1.030035e+08] 6 [ 1.074000e+00 1.100000e-01 -4.000000e-02 1.038255e+08]] 7Target (first 5 rows): 8 [1.588667 1.464 1.28 1.074 1.053333]
This output demonstrates the structured array format of features selected for the machine learning model training, including lag features and the immediate target values for the prediction. By defining these features and targets, we set up our dataset for machine learning models, enabling them to learn from past data to predict future stock prices.
In this lesson, you've learned how to create and implement lag features in the Tesla ($TSLA) stock dataset. Lag features are crucial for capturing temporal dependencies in time series data, significantly improving model performance. You reviewed the dataset, created new features, introduced lag features, handled NaN values, and defined features and target variables.
Next, practice creating lag features with different shifting intervals and explore their impact on predictive performance. This will deepen your understanding and enhance your skills in time series forecasting. Happy coding!