Data Scaling Techniques

Comprehensive Data Wrangling and Analysis with Pandas and NumpyLesson 2

Lesson 2

Lesson Introduction

Hello! Today, we are diving into the world of data scaling techniques. Imagine you are playing a game where you need to fit different shapes into matching holes. If your shapes vary greatly in size, it can be challenging. Similarly, in data analysis and machine learning, features (or columns) in your dataset may have vastly different scales. This can affect the performance of your analysis or model.

Our goal for this lesson is to understand two key data scaling techniques: Standard Scaling and Min-Max Scaling. By the end of this lesson, you'll be able to apply these techniques to scale features in a dataset, making them easier to work with.

Understanding Standard Scaling

Standard Scaling is like leveling the playing field for your data. It transforms your data so it has a mean (average) of 0 and a standard deviation (how spread out the numbers are) of 1. This is especially useful when you want your data to follow a standard normal distribution.

The formula for standard scaling is:

$z = \frac{(X - \mu)}{\sigma}$

Where:

$X$ is the original value.
$\mu$ is the mean of the values.
$\sigma$ is the standard deviation of the values.

In simpler terms, you subtract the average value from each data point and then divide by how much your data varies from the average.

Applying Standard Scaling

Let's use the Titanic dataset to perform Standard Scaling on the age and fare columns.

Python
1import pandas as pd
2import seaborn as sns
3
4# Load the Titanic dataset
5titanic = sns.load_dataset('titanic')
6
7# Calculate mean and standard deviation for 'age' and 'fare'
8age_mean = titanic['age'].mean()
9age_std = titanic['age'].std()
10fare_mean = titanic['fare'].mean()
11fare_std = titanic['fare'].std()
12
13# Standard Scaling
14titanic['age_standard'] = (titanic['age'] - age_mean) / age_std
15titanic['fare_standard'] = (titanic['fare'] - fare_mean) / fare_std
16
17print(titanic[['age', 'age_standard', 'fare', 'fare_standard']].head())

Output:


1    age  age_standard      fare  fare_standard
20  22.0     -0.530005   7.2500      -0.502445
31  38.0      0.571499  71.2833       0.786845
42  26.0     -0.254046   7.9250      -0.488854
53  35.0      0.432593  53.1000       0.420731
64  35.0      0.432593   8.0500      -0.485866

Understanding Min-Max Scaling

Min-Max Scaling adjusts the scale of your data to fit within a specific range, typically between 0 and 1. This is like resizing shapes to fit in a smaller box, making them easier to compare.

The formula for Min-Max Scaling is:

$X' = \frac{(X - X_{min})}{(X_{max} - X_{min})}$

Where:

$X$ is the original value.
$X_{min}$ is the minimum value in the feature.
$X_{max}$ is the maximum value in the feature.

In simpler terms, you subtract the smallest value from each data point and then divide by the range (difference between the largest and smallest values).

Applying Min-Max Scaling

Let's apply Min-Max Scaling to the age and fare columns in the Titanic dataset.

Python
1# Calculate min and max for 'age' and 'fare'
2age_min = titanic['age'].min()
3age_max = titanic['age'].max()
4fare_min = titanic['fare'].min()
5fare_max = titanic['fare'].max()
6
7# Min-Max Scaling
8titanic['age_minmax'] = (titanic['age'] - age_min) / (age_max - age_min)
9titanic['fare_minmax'] = (titanic['fare'] - fare_min) / (fare_max - fare_min)
10
11print(titanic[['age', 'age_minmax', 'fare', 'fare_minmax']].head())

Output:


1    age  age_minmax     fare  fare_minmax
20  22.0    0.271174   7.2500     0.014151
31  38.0    0.472229  71.2833     0.139136
42  26.0    0.321438   7.9250     0.015469
53  35.0    0.434531  53.1000     0.103644
64  35.0    0.434531   8.0500     0.015713

Lesson Summary

Great job! Today, you learned about the importance of data scaling and explored two common techniques: Standard Scaling and Min-Max Scaling. These techniques help bring features to a common scale, making them easier to analyze and work within machine learning models.

Now it's time for some hands-on practice. You'll apply Standard Scaling and Min-Max Scaling to different columns in a dataset using the CodeSignal IDE. This will solidify your understanding and give you practical experience in scaling data. Enjoy scaling your data!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.