In today's lesson, you'll learn how to standardize financial data using the StandardScaler from the sklearn
library. Scaling features ensure that all data contribute equally to machine learning models, improving their performance and robustness.
Lesson Goal: By the end of this lesson, you will be able to effectively scale financial features and understand the importance of this step in preparing data for machine learning.
Let's quickly recall how to load and preprocess the Tesla stock dataset:
Python1import pandas as pd 2import datasets 3 4# Load the dataset 5data = datasets.load_dataset('codesignal/tsla-historic-prices') 6tesla_df = pd.DataFrame(data['train']) 7 8# Feature Engineering: creating new features 9tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low'] 10tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']
We've successfully loaded the Tesla dataset and created new features: High-Low
and Price-Open
.
Feature scaling is crucial for machine learning for several reasons:
Feature scaling is particularly useful in scenarios like:
These examples highlight the importance of scaling to ensure uniform treatment of features, thereby enhancing model performance.
Standardization involves transforming your data so that the mean of each feature is 0 and the standard deviation is 1. This process ensures all features on the same scale, improving the performance and robustness of machine learning models. The formula for standardization is:
where:
By applying this formula, each feature will have a mean of 0 and a standard deviation of 1, enabling more stable and faster convergence during the training of machine learning models.
Let's proceed to scale our features using StandardScaler
from sklearn
. The StandardScaler
standardizes features by removing the mean and scaling to unit variance.
First, we define our features:
Python1from sklearn.preprocessing import StandardScaler 2 3# Defining features 4features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
Now, let's initialize the scaler and apply it to our features:
Python1# Scaling 2scaler = StandardScaler() 3features_scaled = scaler.fit_transform(features)
Here, fit_transform
computes the mean and standard deviation to scale the data and then returns the transformed version.
It's essential to inspect and validate the scaled features to ensure they have been correctly normalized. Let's display the first few rows of the scaled features:
Python1# Displaying the first few scaled features 2print("Scaled features (first 5 rows):\n", features_scaled[:5])
The output of the above code will be:
Plain text1Scaled features (first 5 rows): 2 [[-0.48165383 0.08560547 2.29693712] 3 [-0.48579183 -0.02912844 2.00292929] 4 [-0.50368231 -0.04721815 0.33325453] 5 [-0.51901702 -0.0599476 -0.23997882] 6 [-0.52169457 -0.06145506 0.08156432]]
This output demonstrates that our features have been successfully scaled to have a standardized scale, specifically with mean values hovering around 0 and standard deviation about 1. This scaling ensures equality in feature contribution to the machine learning model.
After scaling your features, it's important to check the mean and standard deviation to ensure they are correctly standardized. You can do this using the following code:
Python1# Checking mean values and standard deviations of scaled features 2scaled_means = features_scaled.mean(axis=0) 3scaled_stds = features_scaled.std(axis=0) 4 5print("\nMean values of scaled features:", scaled_means) 6print("Standard deviations of scaled features:", scaled_stds)
The output will show that the means are close to 0 and the standard deviations are close to 1:
Plain text1Mean values of scaled features: [ 3.39667875e-17 5.57267607e-18 -6.79335750e-17] 2Standard deviations of scaled features: [1. 1. 1.]
This validation confirms that your features have been successfully scaled.
In this lesson, we revisited loading and preprocessing the Tesla stock dataset, discussed the importance of scaling features, and implemented StandardScaler
to normalize our financial data features. By inspecting the scaled features, we ensured they were correctly normalized.
Experiment with scaling other features in the dataset to understand their impact further. This practice will reinforce your understanding and skill in data preprocessing, which is vital for building effective and reliable machine-learning models. Happy coding!