Welcome to our enlightening session on Normalization and Standardization of Passenger Data. These two techniques play a crucial role in preparing your data for machine learning algorithms. During this lesson, our focus will particularly be on the historical Titanic dataset, where we will practice cleaning, normalizing, and standardizing certain features, such as passenger ages and fares. By the end of this lesson, you should have a solid understanding of normalization and standardization and be able to apply these techniques in any data preprocessing assignment using Python and Pandas.
Normalization is a critical preprocessing step, which primarily involves scaling the numerical data in the dataset to a fixed range, usually from 0 to 1. It reduces skewness and bias in the data by bringing all the values to a similar range. Therefore, normalization plays a significant role in algorithms that use a distance measure.
To better illustrate how normalization works, let's apply it to the 'age'
column of our Titanic dataset. Normalization will transform the age values so that they fall within a range from 0 to 1:
Python1# Import necessary libraries 2import seaborn as sns 3import pandas as pd 4 5# Load the Titanic Dataset 6titanic_df = sns.load_dataset('titanic') 7 8# Normalize 'age' 9titanic_df['age'] = (titanic_df['age'] - titanic_df['age'].min()) / (titanic_df['age'].max() - titanic_df['age'].min()) 10 11# Display the normalized ages 12print(titanic_df['age'])
Output:
Markdown10 0.271174
21 0.472229
32 0.321438
43 0.434531
54 0.434531
6 ...
7886 0.334004
8887 0.233476
9888 NaN
10889 0.321438
11890 0.396833
12Name: age, Length: 891, dtype: float64
In this code snippet, we first subtract the minimum age from each age value, then divide by the range of ages. The ages are scaled to the range [0, 1]
. Normalized columns are easier for some machine-learning models to process.
Unlike normalization, standardization does not scale the data to a limited range. Instead, standardization subtracts the mean value of the feature and then divides it by the feature’s standard deviation, transforming the feature values to have a mean of 0 and a standard deviation of 1. This method is often used when you want to compare data that was measured on different scales.
Let's apply standardization to the 'fare'
column of the Titanic dataset. This column represents how much each passenger paid for their ticket:
Python1# Standardize 'fare' 2titanic_df['fare'] = (titanic_df['fare'] - titanic_df['fare'].mean()) / titanic_df['fare'].std() 3 4# Display the standardized fares 5print(titanic_df['fare'])
Output:
Markdown10 -0.502163
21 0.786404
32 -0.488580
43 0.420494
54 -0.486064
6 ...
7886 -0.386454
8887 -0.044356
9888 -0.176164
10889 -0.044356
11890 -0.492101
12Name: fare, Length: 891, dtype: float64
Now, the 'fare'
column is re-scaled so the fares have an average value of 0 and a standard deviation of 1. Notice that the values are not within the [0, 1]
range like normalized data.
Armed with an understanding of normalization, let's dig a little deeper with the Pandas library. We'll use MinMaxScaler()
from the sklearn.preprocessing
module, a handy technique for normalizing data in pandas:
Python1from sklearn.preprocessing import MinMaxScaler 2 3# Select 'age' column and drop NaN values 4age = titanic_df[['age']].dropna() 5 6# Create a MinMaxScaler object 7scaler = MinMaxScaler() 8 9# Use the scaler 10titanic_df['norm_age'] = pd.DataFrame(scaler.fit_transform(age), columns=age.columns, index=age.index) 11 12# Display normalized age values 13print(titanic_df['norm_age'])
Output:
Markdown10 0.271174
21 0.472229
32 0.321438
43 0.434531
54 0.434531
6 ...
7886 0.334004
8887 0.233476
9888 NaN
10889 0.321438
11890 0.396833
12Name: norm_age, Length: 891, dtype: float64
The MinMaxScaler scales and translates each feature individually so that it falls in the given range on the training set, in our case, between 0
and 1
.
To standardize our data with pandas, we'll make use of the StandardScaler()
function from the sklearn.preprocessing
module that standardizes features by deducting the mean and scaling to unit variance:
Python1from sklearn.preprocessing import StandardScaler 2 3# Select 'fare' column and drop NaN values 4fare = titanic_df[['fare']].dropna() 5 6# Create a StandardScaler object 7scaler = StandardScaler() 8 9# Use the scaler 10titanic_df['stand_fare'] = pd.DataFrame(scaler.fit_transform(fare), columns=fare.columns, index=fare.index) 11 12# Display standardized fare values 13print(titanic_df['stand_fare'])
Output:
Markdown10 -0.502445
21 0.786845
32 -0.488854
43 0.420730
54 -0.486337
6 ...
7886 -0.386671
8887 -0.044381
9888 -0.176263
10889 -0.044381
11890 -0.492378
12Name: stand_fare, Length: 891, dtype: float64
StandardScaler
standardizes a feature by deducting the mean and scaling to unit variance. This operation is performed feature-wise in an independent way. Notice how our standardized fares now have a mean of 0 and a standard deviation of 1.
Choose normalization when your data needs to be bounded within a specific range (0 to 1, for example) and is not heavily influenced by outliers. This is particularly useful for algorithms that are sensitive to the scale of the data, such as neural networks and k-nearest neighbors. On the other hand, standardization is more effective when your data has a Gaussian distribution, and you are dealing with algorithms that assume this, such as linear regression, logistic regression, and linear discriminant analysis.
Now that you've got to experience both normalization and standardization, it's safe to say each technique is practical and useful but under different circumstances. Their primary purpose is to handle the varying ranges of data. However, depending on the algorithm deployed and the desired output distribution, normalization or standardization is selected. Remember that not all algorithms benefit from normalization or standardization.
Give yourself a pat on the back as you've made it through the session on data preprocessing techniques! We explored the concepts of normalization and standardization, their practical applications, and how to implement these techniques using Python and Pandas. It's key to remember that these techniques are vital tools in enhancing the performance of your machine-learning models.
Next up, we have some hands-on practice sessions to get your hands dirty with real-world datasets. Remember, the best way to absorb knowledge is by applying it practically. Looking forward to seeing you in the next session! Happy learning!