Lesson 5

Data Preprocessing: Mastering Normalization and Standardization Techniques

Lesson Introduction

Welcome to our enlightening session on Normalization and Standardization of Passenger Data. These two techniques play a crucial role in preparing your data for machine learning algorithms. During this lesson, our focus will particularly be on the historical Titanic dataset, where we will practice cleaning, normalizing, and standardizing certain features, such as passenger ages and fares. By the end of this lesson, you should have a solid understanding of normalization and standardization and be able to apply these techniques in any data preprocessing assignment using Python and Pandas.

Understanding Normalization

Normalization is a critical preprocessing step, which primarily involves scaling the numerical data in the dataset to a fixed range, usually from 0 to 1. It reduces skewness and bias in the data by bringing all the values to a similar range. Therefore, normalization plays a significant role in algorithms that use a distance measure.

To better illustrate how normalization works, let's apply it to the 'age' column of our Titanic dataset. Normalization will transform the age values so that they fall within a range from 0 to 1:

Python
1# Import necessary libraries 2import seaborn as sns 3import pandas as pd 4 5# Load the Titanic Dataset 6titanic_df = sns.load_dataset('titanic') 7 8# Normalize 'age' 9titanic_df['age'] = (titanic_df['age'] - titanic_df['age'].min()) / (titanic_df['age'].max() - titanic_df['age'].min()) 10 11# Display the normalized ages 12print(titanic_df['age'])

Output:

Markdown
10 0.271174 21 0.472229 32 0.321438 43 0.434531 54 0.434531 6 ... 7886 0.334004 8887 0.233476 9888 NaN 10889 0.321438 11890 0.396833 12Name: age, Length: 891, dtype: float64

In this code snippet, we first subtract the minimum age from each age value, then divide by the range of ages. The ages are scaled to the range [0, 1]. Normalized columns are easier for some machine-learning models to process.

Understanding Standardization

Unlike normalization, standardization does not scale the data to a limited range. Instead, standardization subtracts the mean value of the feature and then divides it by the feature’s standard deviation, transforming the feature values to have a mean of 0 and a standard deviation of 1. This method is often used when you want to compare data that was measured on different scales.

Let's apply standardization to the 'fare' column of the Titanic dataset. This column represents how much each passenger paid for their ticket:

Python
1# Standardize 'fare' 2titanic_df['fare'] = (titanic_df['fare'] - titanic_df['fare'].mean()) / titanic_df['fare'].std() 3 4# Display the standardized fares 5print(titanic_df['fare'])

Output:

Markdown
10 -0.502163 21 0.786404 32 -0.488580 43 0.420494 54 -0.486064 6 ... 7886 -0.386454 8887 -0.044356 9888 -0.176164 10889 -0.044356 11890 -0.492101 12Name: fare, Length: 891, dtype: float64

Now, the 'fare' column is re-scaled so the fares have an average value of 0 and a standard deviation of 1. Notice that the values are not within the [0, 1] range like normalized data.

Implementing Normalization with Pandas

Armed with an understanding of normalization, let's dig a little deeper with the Pandas library. We'll use MinMaxScaler() from the sklearn.preprocessing module, a handy technique for normalizing data in pandas:

Python
1from sklearn.preprocessing import MinMaxScaler 2 3# Select 'age' column and drop NaN values 4age = titanic_df[['age']].dropna() 5 6# Create a MinMaxScaler object 7scaler = MinMaxScaler() 8 9# Use the scaler 10titanic_df['norm_age'] = pd.DataFrame(scaler.fit_transform(age), columns=age.columns, index=age.index) 11 12# Display normalized age values 13print(titanic_df['norm_age'])

Output:

Markdown
10 0.271174 21 0.472229 32 0.321438 43 0.434531 54 0.434531 6 ... 7886 0.334004 8887 0.233476 9888 NaN 10889 0.321438 11890 0.396833 12Name: norm_age, Length: 891, dtype: float64

The MinMaxScaler scales and translates each feature individually so that it falls in the given range on the training set, in our case, between 0 and 1.

Implementing Standardization with Pandas

To standardize our data with pandas, we'll make use of the StandardScaler() function from the sklearn.preprocessing module that standardizes features by deducting the mean and scaling to unit variance:

Python
1from sklearn.preprocessing import StandardScaler 2 3# Select 'fare' column and drop NaN values 4fare = titanic_df[['fare']].dropna() 5 6# Create a StandardScaler object 7scaler = StandardScaler() 8 9# Use the scaler 10titanic_df['stand_fare'] = pd.DataFrame(scaler.fit_transform(fare), columns=fare.columns, index=fare.index) 11 12# Display standardized fare values 13print(titanic_df['stand_fare'])

Output:

Markdown
10 -0.502445 21 0.786845 32 -0.488854 43 0.420730 54 -0.486337 6 ... 7886 -0.386671 8887 -0.044381 9888 -0.176263 10889 -0.044381 11890 -0.492378 12Name: stand_fare, Length: 891, dtype: float64

StandardScaler standardizes a feature by deducting the mean and scaling to unit variance. This operation is performed feature-wise in an independent way. Notice how our standardized fares now have a mean of 0 and a standard deviation of 1.

Comparing Normalization and Standardization

Choose normalization when your data needs to be bounded within a specific range (0 to 1, for example) and is not heavily influenced by outliers. This is particularly useful for algorithms that are sensitive to the scale of the data, such as neural networks and k-nearest neighbors. On the other hand, standardization is more effective when your data has a Gaussian distribution, and you are dealing with algorithms that assume this, such as linear regression, logistic regression, and linear discriminant analysis.

Now that you've got to experience both normalization and standardization, it's safe to say each technique is practical and useful but under different circumstances. Their primary purpose is to handle the varying ranges of data. However, depending on the algorithm deployed and the desired output distribution, normalization or standardization is selected. Remember that not all algorithms benefit from normalization or standardization.

Lesson Summary and Practice

Give yourself a pat on the back as you've made it through the session on data preprocessing techniques! We explored the concepts of normalization and standardization, their practical applications, and how to implement these techniques using Python and Pandas. It's key to remember that these techniques are vital tools in enhancing the performance of your machine-learning models.

Next up, we have some hands-on practice sessions to get your hands dirty with real-world datasets. Remember, the best way to absorb knowledge is by applying it practically. Looking forward to seeing you in the next session! Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.