In today's lesson, we’ll examine an important preprocessing step for predictive modeling: normalizing features. Normalization adjusts the scale of feature data so that no single feature with a larger or smaller scale dominates the model. Our mission is to learn why normalization is necessary and to understand two primary methods of normalization, applying these techniques to the California Housing Dataset using Python.
Normalization addresses the issue of features having different ranges. Without scaling, features with larger value ranges could unfairly influence the results of our predictive model. In simple terms, if one feature has values ranging from 0 to 100 and another from 0 to 1, the first feature might dominate the model training process. As we work with features like house age and median income, normalizing helps ensure that each feature contributes to the model based on its importance, not merely its scale.
Standard scaling is a method that rescales the features so that they have a mean of zero and a standard deviation of one. This method calculates the z-score of each data point, which represents how many standard deviations a data point is from the mean. Let's apply standard scaling using Python:
Python1import pandas as pd 2from sklearn.datasets import fetch_california_housing 3from sklearn.preprocessing import StandardScaler 4 5# Fetch the dataset and create the DataFrame 6housing_data = fetch_california_housing() 7df = pd.DataFrame(housing_data.data, columns=housing_data.feature_names) 8df['MedHouseValue'] = housing_data.target 9 10# StandardScaler object 11scaler = StandardScaler() 12 13# Compute the mean and standard deviation based on the training data 14scaler.fit(df[['HouseAge']]) 15 16# Perform the standardization by centering and scaling 17housing_median_age_scaled = scaler.transform(df[['HouseAge']]) 18 19# Original vs. Standard Scaled Data 20print("Original 'HouseAge' Head:") 21print(df[['HouseAge']].head()) 22print("\nScaled 'HouseAge' Head:") 23print(pd.DataFrame(housing_median_age_scaled, columns=['HouseAge']).head())
Plain text1Original HouseAge Head: 41.0, 21.0, 52.0, 52.0, 52.0 2 3Scaled HouseAge Head: 0.982143, -0.607019, 1.856182, 1.856182, 1.856182
Min-max scaling is another technique that scales the data so that all the feature values are in the range of 0 to 1. This scaling ensures that values closer to 0 are closer to the minimum value of the raw data, while values closer to 1 are closer to the maximum value of the raw data. Let's see how this is done in practice:
Python1from sklearn.preprocessing import MinMaxScaler 2 3# Initializing MinMaxScaler 4min_max_scaler = MinMaxScaler() 5 6# Fit the scaler to the data 7min_max_scaler.fit(df[['MedInc']]) 8 9# Transform the data using the fitted scaler 10median_income_min_max_scaled = min_max_scaler.transform(df[['MedInc']]) 11 12# Original vs. scaled data 13print("Original 'MedInc' Head:") 14print(df[['MedInc']].head()) 15print("\nMin-Max Scaled 'MedInc' Head:") 16print(pd.DataFrame(median_income_min_max_scaled, columns=['MedInc']).head())
Plain text1Original MedInc Head: 8.3252, 8.3014, 7.2574, 5.6431, 3.8462 2Min-Max Scaled MedInc Head: 0.539668, 0.538027, 0.466028, 0.354699, 0.230776
Normalization techniques can be applied to feature data individually through separate fit
and transform
processes or simultaneously using the fit_transform
method, which is a convenient way to perform both steps in one call. This method is particularly useful during the initial model training phase when preparing your dataset:
Python1from sklearn.datasets import fetch_california_housing 2from sklearn.preprocessing import StandardScaler, MinMaxScaler 3import pandas as pd 4 5housing_data = fetch_california_housing() 6df = pd.DataFrame(housing_data.data, columns=housing_data.feature_names) 7 8# StandardScaler and MinMaxScaler for all features 9scalers = {'StandardScaler': StandardScaler(), 10 'MinMaxScaler': MinMaxScaler()} 11 12for scaler_name, scaler in scalers.items(): 13 scaled_data = scaler.fit_transform(df) # combining fit and transform for convenience 14 print(f"\n{scaler_name} Scaled Data:") 15 print(pd.DataFrame(scaled_data, columns=housing_data.feature_names).iloc[0])
Plain text1StandardScaler Scaled Data: 2MedInc 2.344766 3HouseAge 0.982143 4AveRooms 0.628559 5AveBedrms -0.153758 6Population -0.974429 7AveOccup -0.049597 8Latitude 1.052548 9Longitude -1.327835 10Name: 0, dtype: float64 11 12MinMaxScaler Scaled Data: 13MedInc 0.539668 14HouseAge 0.784314 15AveRooms 0.043512 16AveBedrms 0.020469 17Population 0.008941 18AveOccup 0.001499 19Latitude 0.567481 20Longitude 0.211155 21Name: 0, dtype: float64
Normalization ensures that all features are on a similar scale and range, which is crucial for the balanced contribution of each feature to the predictive model. In this lesson, we've explained two techniques of normalization—standardization and min-max scaling—and provided instances where each is suitable. We have transformed features using these methods and considered their impact on datasets. Now you'll be applying normalization techniques to datasets, reinforcing learning through hands-on exercises. These exercises will allow you to grasp how different normalization methods can affect your model's predictive power.