Hello! Today, we are diving into the world of data scaling techniques. Imagine you are playing a game where you need to fit different shapes into matching holes. If your shapes vary greatly in size, it can be challenging. Similarly, in data analysis and machine learning, features (or columns) in your dataset may have vastly different scales. This can affect the performance of your analysis or model.
Our goal for this lesson is to understand two key data scaling techniques: Standard Scaling and Min-Max Scaling. By the end of this lesson, you'll be able to apply these techniques to scale features in a dataset, making them easier to work with.
Standard Scaling is like leveling the playing field for your data. It transforms your data so it has a mean (average) of 0 and a standard deviation (how spread out the numbers are) of 1. This is especially useful when you want your data to follow a standard normal distribution.
The formula for standard scaling is:
Where:
In simpler terms, you subtract the average value from each data point and then divide by how much your data varies from the average.
Let's use the Titanic dataset to perform Standard Scaling on the age
and fare
columns.
Python1import pandas as pd 2import seaborn as sns 3 4# Load the Titanic dataset 5titanic = sns.load_dataset('titanic') 6 7# Calculate mean and standard deviation for 'age' and 'fare' 8age_mean = titanic['age'].mean() 9age_std = titanic['age'].std() 10fare_mean = titanic['fare'].mean() 11fare_std = titanic['fare'].std() 12 13# Standard Scaling 14titanic['age_standard'] = (titanic['age'] - age_mean) / age_std 15titanic['fare_standard'] = (titanic['fare'] - fare_mean) / fare_std 16 17print(titanic[['age', 'age_standard', 'fare', 'fare_standard']].head())
Output:
1 age age_standard fare fare_standard 20 22.0 -0.530005 7.2500 -0.502445 31 38.0 0.571499 71.2833 0.786845 42 26.0 -0.254046 7.9250 -0.488854 53 35.0 0.432593 53.1000 0.420731 64 35.0 0.432593 8.0500 -0.485866
Min-Max Scaling adjusts the scale of your data to fit within a specific range, typically between 0 and 1. This is like resizing shapes to fit in a smaller box, making them easier to compare.
The formula for Min-Max Scaling is:
Where:
In simpler terms, you subtract the smallest value from each data point and then divide by the range (difference between the largest and smallest values).
Let's apply Min-Max Scaling to the age
and fare
columns in the Titanic dataset.
Python1# Calculate min and max for 'age' and 'fare' 2age_min = titanic['age'].min() 3age_max = titanic['age'].max() 4fare_min = titanic['fare'].min() 5fare_max = titanic['fare'].max() 6 7# Min-Max Scaling 8titanic['age_minmax'] = (titanic['age'] - age_min) / (age_max - age_min) 9titanic['fare_minmax'] = (titanic['fare'] - fare_min) / (fare_max - fare_min) 10 11print(titanic[['age', 'age_minmax', 'fare', 'fare_minmax']].head())
Output:
1 age age_minmax fare fare_minmax 20 22.0 0.271174 7.2500 0.014151 31 38.0 0.472229 71.2833 0.139136 42 26.0 0.321438 7.9250 0.015469 53 35.0 0.434531 53.1000 0.103644 64 35.0 0.434531 8.0500 0.015713
Great job! Today, you learned about the importance of data scaling and explored two common techniques: Standard Scaling and Min-Max Scaling. These techniques help bring features to a common scale, making them easier to analyze and work within machine learning models.
Now it's time for some hands-on practice. You'll apply Standard Scaling and Min-Max Scaling to different columns in a dataset using the CodeSignal IDE. This will solidify your understanding and give you practical experience in scaling data. Enjoy scaling your data!