Diving into Data Transformation and Scaling Techniques

Intro to Data Cleaning and Preprocessing with TitanicLesson 3

Lesson 3

Topic Overview

Ready for another deep dive? Today, we'll explore Data Transformation and Scaling Techniques, an essential constituent of the data cleaning and preprocessing process for machine learning. We will learn how to transform numerical data to different ranges using various scaling techniques, such as Standard Scaling, Min-Max Scaling, and Robust Scaling.

Data scaling is crucial because machine learning algorithms perform more effectively when numerical features are on the same scale. Without scaling, variables with higher ranges may dominate others in the machine learning models, reducing the model's accuracy.

For example, imagine having two features — age and income — in your Titanic dataset. Age varies between 0 and 100, while income may range from 0 to thousands. A machine learning model could be biased towards income because of its higher magnitude, leading to poor model performance.

Ready to dive in? Let's go!

Introduction to Data Scaling

Before we move into the hands-on part, let's briefly discuss three popular techniques to standardize numerical data.

Standard Scaler: It assumes data is normally distributed and scales it to have zero mean and unit variance. It's best used when the data is normally distributed. In other words, when the values of a particular feature follow a bell curve, a Standard Scaler is a good option to standardize the feature.
Min-Max Scaler: Also known as normalization, this technique scales data to range between 0 and 1 (or -1 to 1 if there are negative values). It's commonly used for algorithms that don't assume any distribution of the data. This means if your data doesn't follow a specific shape or form, you might consider using Min-Max Scaler.
Robust Scaler: As its name suggests, this scaler is robust to outliers. It uses the Interquartile Range (IQR) to scale data, and it's suitable when the dataset contains outliers. Outliers are data points that significantly deviate from other observations. They can be problematic because they can affect the results of a data analysis.

There's no "one size fits all" scaler. You'll need to choose the appropriate scaler based on your data's characteristics and your machine-learning algorithm's requirements.

Standard Scaling

We'll start with the Standard Scaler. It scales data based on its mean ( $\mu$ ) and standard deviation ( $\sigma$ ), using the formula to calculate the z-score: $z = \frac{x - \mu}{\sigma}$ .

Let's try it on the age column of the Titanic dataset:

Python
1import numpy as np
2import seaborn as sns
3from sklearn.preprocessing import StandardScaler
4
5# Load the dataset and drop rows with missing values
6titanic_df = sns.load_dataset('titanic').dropna()
7
8# Initialize the StandardScaler
9std_scaler = StandardScaler()
10
11# Fit and transform the 'age' column
12titanic_df['age'] = std_scaler.fit_transform(np.array(titanic_df['age']).reshape(-1, 1))
13
14# Check the transformed 'age' column
15print(titanic_df['age'].head())
16"""
171     0.152082
183    -0.039875
196     1.175852
2010   -2.023430
2111    1.431795
22Name: age, dtype: float64
23"""

Note how the transformed age values are not easily interpretable. That's because they've been transformed into their respective z-scores. But the important thing to understand is the transformed data is standardized and can be readily included in a machine learning model.

Standard Scaling Visualization

Min-Max Scaling

Next, we'll explore Min-Max Scaling, which scales your data to a specified range. The formula used here is: $x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}$ . This formula essentially resizes your data to fit within the range of 0 to 1.

Let's apply Min-Max Scaler on the fare column:

Python
1from sklearn.preprocessing import MinMaxScaler
2
3# Initialize the MinMaxScaler
4min_max_scaler = MinMaxScaler()
5
6# Fit and transform the 'fare' column
7titanic_df['fare'] = min_max_scaler.fit_transform(np.array(titanic_df['fare']).reshape(-1, 1))
8
9# Check the transformed 'fare' column
10print(titanic_df['fare'].head())
11"""
121     0.139136
133     0.103644
146     0.101229
1510    0.032596
1611    0.051822
17Name: fare, dtype: float64
18"""

All fare values are now within the range of 0 to 1, with the smallest fare being 0 and the largest being 1. Intermediate fare values are spread out proportionally between 0 and 1.

Robust Scaling

Last but not least, we have Robust Scaling useful when dealing with outliers, as it scales data according to its IQR (Inter Quartile Range). Effectively, it's robust against outliers since it uses the IQR, and outliers fall outside the IQR.

Let's apply it to the fare column:

Python
1from sklearn.preprocessing import RobustScaler
2
3# Initialize the RobustScaler
4robust_scaler = RobustScaler()
5
6# Fit and transform the 'fare' column
7titanic_df['fare'] = robust_scaler.fit_transform(np.array(titanic_df['fare']).reshape(-1, 1))
8
9# Check the transformed 'fare' column
10print(titanic_df['fare'].head())
11"""
121     0.236871
133    -0.064677
146    -0.085199
1510   -0.668325
1611   -0.504975
17Name: fare, dtype: float64
18"""

The fare values now reflect how many IQRs are away from the median. This scaling method is resilient to outliers, which effectively become small positive and negative values.

Wrapping up the Lesson

You should now understand why data scaling is essential in machine learning and how to implement three common data scaling techniques in Python: Standard Scaling, Min-Max Scaling, and Robust Scaling.

Remember, the choice of scaling technique depends on the nature of your data and the specific requirements of your machine-learning algorithm. Each scaler has its strengths: Standard Scaler works best with data that are normally distributed, Min-Max Scaler is adaptable with data of any shape, and Robust Scaler is capable of handling outliers.

Ready for Practice?

Great work on assimilating the essentials of data transformation and scaling! Let's move to the next part—practice! The exercises are designed to deepen your understanding of data scaling techniques. You'll code, implement the learning, and apply these techniques to various data distributions. So roll up your sleeves and get ready for some coding action! You can expect to gain much deeper insights and develop your data scaling expertise during the practice session, so don't miss it!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.