Data Preprocessing: Mastering Normalization and Standardization Techniques

Lesson 5

Lesson Introduction

Welcome to our enlightening session on Normalization and Standardization of Passenger Data. These two techniques play a crucial role in preparing your data for machine learning algorithms. During this lesson, our focus will particularly be on the historical Titanic dataset, where we will practice cleaning, normalizing, and standardizing certain features, such as passenger ages and fares. By the end of this lesson, you should have a solid understanding of normalization and standardization and be able to apply these techniques in any data preprocessing assignment using Python and Pandas.

Understanding Normalization

Normalization is a critical preprocessing step, which primarily involves scaling the numerical data in the dataset to a fixed range, usually from 0 to 1. It reduces skewness and bias in the data by bringing all the values to a similar range. Therefore, normalization plays a significant role in algorithms that use a distance measure.

To better illustrate how normalization works, let's apply it to the 'age' column of our Titanic dataset. Normalization will transform the age values so that they fall within a range from 0 to 1:

Python
1# Import necessary libraries
2import seaborn as sns
3import pandas as pd
4
5# Load the Titanic Dataset
6titanic_df = sns.load_dataset('titanic')
7
8# Normalize 'age'
9titanic_df['age'] = (titanic_df['age'] - titanic_df['age'].min()) / (titanic_df['age'].max() - titanic_df['age'].min())
10
11# Display the normalized ages
12print(titanic_df['age'])

Output:

Markdown
10      0.271174
21      0.472229
32      0.321438
43      0.434531
54      0.434531
6         ...   
7886    0.334004
8887    0.233476
9888         NaN
10889    0.321438
11890    0.396833
12Name: age, Length: 891, dtype: float64

In this code snippet, we first subtract the minimum age from each age value, then divide by the range of ages. The ages are scaled to the range [0, 1]. Normalized columns are easier for some machine-learning models to process.

Understanding Standardization

Unlike normalization, standardization does not scale the data to a limited range. Instead, standardization subtracts the mean value of the feature and then divides it by the feature’s standard deviation, transforming the feature values to have a mean of 0 and a standard deviation of 1. This method is often used when you want to compare data that was measured on different scales.

Let's apply standardization to the 'fare' column of the Titanic dataset. This column represents how much each passenger paid for their ticket:

Python
1# Standardize 'fare'
2titanic_df['fare'] = (titanic_df['fare'] - titanic_df['fare'].mean()) / titanic_df['fare'].std()
3
4# Display the standardized fares
5print(titanic_df['fare'])

Output:

Markdown
10     -0.502163
21      0.786404
32     -0.488580
43      0.420494
54     -0.486064
6         ...   
7886   -0.386454
8887   -0.044356
9888   -0.176164
10889   -0.044356
11890   -0.492101
12Name: fare, Length: 891, dtype: float64

Now, the 'fare' column is re-scaled so the fares have an average value of 0 and a standard deviation of 1. Notice that the values are not within the [0, 1] range like normalized data.

Implementing Normalization with Pandas

Armed with an understanding of normalization, let's dig a little deeper with the Pandas library. We'll use MinMaxScaler() from the sklearn.preprocessing module, a handy technique for normalizing data in pandas:

Python
1from sklearn.preprocessing import MinMaxScaler
2
3# Select 'age' column and drop NaN values
4age = titanic_df[['age']].dropna()
5
6# Create a MinMaxScaler object
7scaler = MinMaxScaler()
8
9# Use the scaler
10titanic_df['norm_age'] = pd.DataFrame(scaler.fit_transform(age), columns=age.columns, index=age.index)
11
12# Display normalized age values
13print(titanic_df['norm_age'])

Output:

Markdown
10      0.271174
21      0.472229
32      0.321438
43      0.434531
54      0.434531
6         ...   
7886    0.334004
8887    0.233476
9888         NaN
10889    0.321438
11890    0.396833
12Name: norm_age, Length: 891, dtype: float64

The MinMaxScaler scales and translates each feature individually so that it falls in the given range on the training set, in our case, between 0 and 1.

Implementing Standardization with Pandas

To standardize our data with pandas, we'll make use of the StandardScaler() function from the sklearn.preprocessing module that standardizes features by deducting the mean and scaling to unit variance:

Python
1from sklearn.preprocessing import StandardScaler
2
3# Select 'fare' column and drop NaN values
4fare = titanic_df[['fare']].dropna()
5
6# Create a StandardScaler object
7scaler = StandardScaler()
8
9# Use the scaler
10titanic_df['stand_fare'] = pd.DataFrame(scaler.fit_transform(fare), columns=fare.columns, index=fare.index)
11
12# Display standardized fare values
13print(titanic_df['stand_fare'])

Output:

Markdown
10     -0.502445
21      0.786845
32     -0.488854
43      0.420730
54     -0.486337
6         ...   
7886   -0.386671
8887   -0.044381
9888   -0.176263
10889   -0.044381
11890   -0.492378
12Name: stand_fare, Length: 891, dtype: float64

StandardScaler standardizes a feature by deducting the mean and scaling to unit variance. This operation is performed feature-wise in an independent way. Notice how our standardized fares now have a mean of 0 and a standard deviation of 1.

Comparing Normalization and Standardization

Choose normalization when your data needs to be bounded within a specific range (0 to 1, for example) and is not heavily influenced by outliers. This is particularly useful for algorithms that are sensitive to the scale of the data, such as neural networks and k-nearest neighbors. On the other hand, standardization is more effective when your data has a Gaussian distribution, and you are dealing with algorithms that assume this, such as linear regression, logistic regression, and linear discriminant analysis.

Now that you've got to experience both normalization and standardization, it's safe to say each technique is practical and useful but under different circumstances. Their primary purpose is to handle the varying ranges of data. However, depending on the algorithm deployed and the desired output distribution, normalization or standardization is selected. Remember that not all algorithms benefit from normalization or standardization.

Lesson Summary and Practice

Give yourself a pat on the back as you've made it through the session on data preprocessing techniques! We explored the concepts of normalization and standardization, their practical applications, and how to implement these techniques using Python and Pandas. It's key to remember that these techniques are vital tools in enhancing the performance of your machine-learning models.

Next up, we have some hands-on practice sessions to get your hands dirty with real-world datasets. Remember, the best way to absorb knowledge is by applying it practically. Looking forward to seeing you in the next session! Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.