Hey there! Today, we're going to learn about feature scaling. You might be wondering, what is feature scaling, and why should we care? Simply put, feature scaling is like making sure all the ingredients in your recipe are measured in the same unit. Imagine trying to mix pounds of flour and teaspoons of salt without converting one to the other — it wouldn't make sense, right?
Our goal is to understand why feature scaling is crucial in machine learning and to learn how to do it using Python and a library called SciKit Learn
.
Feature scaling ensures that all your data features contribute equally when building a machine learning model. Without scaling, large values might dominate, leading to biased outcomes. For example, if predicting house prices, and one feature was in thousands (like square footage) and another in single digits (like the number of rooms), the model might overlook the smaller values just because they seem less relevant.
There are two common types:
-
Standardization: Transforms data to have a mean () of 0 and a standard deviation () of 1.
Formula: , where is the original feature value, is the mean of the feature, and is the standard deviation of the feature.
-
Normalization: Rescales data to range between 0 and 1.
Formula: , where is the original feature value, is the minimum value of the feature, and is the maximum value of the feature.
Today, we'll focus on both standardization using StandardScaler
and normalization using MinMaxScaler
from SciKit Learn
.
Let's create a small sample dataset to see how feature scaling works.
Python1import pandas as pd 2from sklearn.preprocessing import StandardScaler, MinMaxScaler 3 4# Sample dataset 5data = {'Feature1': [1, 2, 3, 4], 'Feature2': [10, 20, 30, 40]} 6df = pd.DataFrame(data) 7print("Original DataFrame:") 8print(df)
Output:
1Original DataFrame: 2 Feature1 Feature2 30 1 10 41 2 20 52 3 30 63 4 40
Before scaling, Feature1
ranges from 1 to 4, and Feature2
ranges from 10 to 40. Let's scale this dataset using StandardScaler
.
We’ll use the StandardScaler
to perform the scaling. The fit_transform
method will calculate the mean and standard deviation for scaling, and then apply the scaling to the data.
Python1# Feature scaling with StandardScaler 2standard_scaler = StandardScaler() 3scaled_data_standard = standard_scaler.fit_transform(df)
Continuing from where we left off, we need to convert this scaled data back to a DataFrame for better readability.
Python1# Convert the scaled data back to a DataFrame for better readability 2scaled_df_standard = pd.DataFrame(scaled_data_standard, columns=df.columns) 3print("Scaled DataFrame (StandardScaler):") 4print(scaled_df_standard)
Output:
1Scaled DataFrame (StandardScaler): 2 Feature1 Feature2 30 -1.341641 -1.341641 41 -0.447214 -0.447214 52 0.447214 0.447214 63 1.341641 1.341641
Let's check if the data is scaled correctly. We will calculate mean and standard deviation for both features:
Python1print("Mean of each feature after scaling (should be close to 0):") 2print(scaled_df_standard.mean()) 3print("Standard deviation of each feature after scaling (should be close to 1):") 4print(scaled_df_standard.std())
Here is the output:
1Mean of each feature after scaling (should be close to 0): 2Feature1 0.0 3Feature2 0.0 4dtype: float64 5 6Standard deviation of each feature after scaling (should be close to 1): 7Feature1 1.0 8Feature2 1.0 9dtype: float64
The mean of each feature in the scaled DataFrame is 0, and the standard deviation is 1. This makes it easier for the machine learning model to treat all features equally.
Let's also apply feature scaling using the MinMaxScaler
to see how normalization works. The good news is that using the MinMaxScaler
is exactly the same as for the StandardScaler
. You literally just change the scaler's name and everything works!
Python1# Feature scaling with MinMaxScaler 2minmax_scaler = MinMaxScaler() 3scaled_data_minmax = minmax_scaler.fit_transform(df)
Convert the normalized data back to a DataFrame for better readability and verify the range.
Python1# Convert the scaled data back to a DataFrame for better readability 2scaled_df_minmax = pd.DataFrame(scaled_data_minmax, columns=df.columns) 3print("Scaled DataFrame (MinMaxScaler):") 4print(scaled_df_minmax)
Output:
1Scaled DataFrame (MinMaxScaler): 2 Feature1 Feature2 30 0.0 0.0 41 0.3 0.3 52 0.6 0.6 63 1.0 1.0
Let's validate the results:
Python1print("Minimum of each feature after scaling (should be 0):") 2print(scaled_df_minmax.min()) 3print("Maximum of each feature after scaling (should be 1):") 4print(scaled_df_minmax.max())
Output:
1Minimum of each feature after scaling (should be 0): 2Feature1 0.0 3Feature2 0.0 4dtype: float64 5 6Maximum of each feature after scaling (should be 1): 7Feature1 1.0 8Feature2 1.0 9dtype: float64
The minimum of each feature in the scaled DataFrame is 0, and the maximum is 1, ensuring that all data points fall within this range.
Great job! You learned what feature scaling is and why it is essential in machine learning. By scaling your features, you ensure that all data points contribute equally to the model. You also got hands-on with Python, StandardScaler
, and MinMaxScaler
from SciKit Learn
to both standardize and normalize a sample dataset.
Now it's time to move on to some practice exercises. You'll get the chance to apply what you learned and become even more confident in your ability to scale features effectively. Let's get started!