Lesson 3
Feature Scaling
Lesson Introduction

Hey there! Today, we're going to learn about feature scaling. You might be wondering, what is feature scaling, and why should we care? Simply put, feature scaling is like making sure all the ingredients in your recipe are measured in the same unit. Imagine trying to mix pounds of flour and teaspoons of salt without converting one to the other — it wouldn't make sense, right?

Our goal is to understand why feature scaling is crucial in machine learning and to learn how to do it using Python and a library called SciKit Learn.

What is Feature Scaling?

Feature scaling ensures that all your data features contribute equally when building a machine learning model. Without scaling, large values might dominate, leading to biased outcomes. For example, if predicting house prices, and one feature was in thousands (like square footage) and another in single digits (like the number of rooms), the model might overlook the smaller values just because they seem less relevant.

There are two common types:

  1. Standardization: Transforms data to have a mean (μ\mu) of 0 and a standard deviation (σ\sigma) of 1.

    Formula: z=xμσz = \frac{x - \mu}{\sigma}, where xx is the original feature value, μ\mu is the mean of the feature, and σ\sigma is the standard deviation of the feature.

  2. Normalization: Rescales data to range between 0 and 1.

    Formula: x=xmin(x)max(x)min(x)x' = \frac{x - \min(x)}{\max(x) - \min(x)}, where xx is the original feature value, min(x)\min(x) is the minimum value of the feature, and max(x)\max(x) is the maximum value of the feature.

Today, we'll focus on both standardization using StandardScaler and normalization using MinMaxScaler from SciKit Learn.

Example of Feature Scaling with `StandardScaler`

Let's create a small sample dataset to see how feature scaling works.

Python
1import pandas as pd 2from sklearn.preprocessing import StandardScaler, MinMaxScaler 3 4# Sample dataset 5data = {'Feature1': [1, 2, 3, 4], 'Feature2': [10, 20, 30, 40]} 6df = pd.DataFrame(data) 7print("Original DataFrame:") 8print(df)

Output:

1Original DataFrame: 2 Feature1 Feature2 30 1 10 41 2 20 52 3 30 63 4 40

Before scaling, Feature1 ranges from 1 to 4, and Feature2 ranges from 10 to 40. Let's scale this dataset using StandardScaler.

Applying Feature Scaling with `StandardScaler`

We’ll use the StandardScaler to perform the scaling. The fit_transform method will calculate the mean and standard deviation for scaling, and then apply the scaling to the data.

Python
1# Feature scaling with StandardScaler 2standard_scaler = StandardScaler() 3scaled_data_standard = standard_scaler.fit_transform(df)

Continuing from where we left off, we need to convert this scaled data back to a DataFrame for better readability.

Python
1# Convert the scaled data back to a DataFrame for better readability 2scaled_df_standard = pd.DataFrame(scaled_data_standard, columns=df.columns) 3print("Scaled DataFrame (StandardScaler):") 4print(scaled_df_standard)

Output:

1Scaled DataFrame (StandardScaler): 2 Feature1 Feature2 30 -1.341641 -1.341641 41 -0.447214 -0.447214 52 0.447214 0.447214 63 1.341641 1.341641
Scaling Double-check

Let's check if the data is scaled correctly. We will calculate mean and standard deviation for both features:

Python
1print("Mean of each feature after scaling (should be close to 0):") 2print(scaled_df_standard.mean()) 3print("Standard deviation of each feature after scaling (should be close to 1):") 4print(scaled_df_standard.std())

Here is the output:

1Mean of each feature after scaling (should be close to 0): 2Feature1 0.0 3Feature2 0.0 4dtype: float64 5 6Standard deviation of each feature after scaling (should be close to 1): 7Feature1 1.0 8Feature2 1.0 9dtype: float64

The mean of each feature in the scaled DataFrame is 0, and the standard deviation is 1. This makes it easier for the machine learning model to treat all features equally.

Example of Feature Scaling with `MinMaxScaler`

Let's also apply feature scaling using the MinMaxScaler to see how normalization works. The good news is that using the MinMaxScaler is exactly the same as for the StandardScaler. You literally just change the scaler's name and everything works!

Python
1# Feature scaling with MinMaxScaler 2minmax_scaler = MinMaxScaler() 3scaled_data_minmax = minmax_scaler.fit_transform(df)

Convert the normalized data back to a DataFrame for better readability and verify the range.

Python
1# Convert the scaled data back to a DataFrame for better readability 2scaled_df_minmax = pd.DataFrame(scaled_data_minmax, columns=df.columns) 3print("Scaled DataFrame (MinMaxScaler):") 4print(scaled_df_minmax)

Output:

1Scaled DataFrame (MinMaxScaler): 2 Feature1 Feature2 30 0.0 0.0 41 0.3 0.3 52 0.6 0.6 63 1.0 1.0
Scaling Double-Check

Let's validate the results:

Python
1print("Minimum of each feature after scaling (should be 0):") 2print(scaled_df_minmax.min()) 3print("Maximum of each feature after scaling (should be 1):") 4print(scaled_df_minmax.max())

Output:

1Minimum of each feature after scaling (should be 0): 2Feature1 0.0 3Feature2 0.0 4dtype: float64 5 6Maximum of each feature after scaling (should be 1): 7Feature1 1.0 8Feature2 1.0 9dtype: float64

The minimum of each feature in the scaled DataFrame is 0, and the maximum is 1, ensuring that all data points fall within this range.

Lesson Summary

Great job! You learned what feature scaling is and why it is essential in machine learning. By scaling your features, you ensure that all data points contribute equally to the model. You also got hands-on with Python, StandardScaler, and MinMaxScaler from SciKit Learn to both standardize and normalize a sample dataset.

Now it's time to move on to some practice exercises. You'll get the chance to apply what you learned and become even more confident in your ability to scale features effectively. Let's get started!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.