Ready for another deep dive? Today, we'll explore Data Transformation and Scaling Techniques, an essential constituent of the data cleaning and preprocessing process for machine learning. We will learn how to transform numerical data to different ranges using various scaling techniques, such as Standard Scaling
, Min-Max Scaling
, and Robust Scaling
.
Data scaling is crucial because machine learning algorithms perform more effectively when numerical features are on the same scale. Without scaling, variables with higher ranges may dominate others in the machine learning models, reducing the model's accuracy.
For example, imagine having two features — age
and income
— in your Titanic dataset. Age varies between 0 and 100, while income may range from 0 to thousands. A machine learning model could be biased towards income because of its higher magnitude, leading to poor model performance.
Ready to dive in? Let's go!
Before we move into the hands-on part, let's briefly discuss three popular techniques to standardize numerical data.
Standard Scaler: It assumes data is normally distributed and scales it to have zero mean and unit variance. It's best used when the data is normally distributed. In other words, when the values of a particular feature follow a bell curve, a Standard Scaler is a good option to standardize the feature.
Min-Max Scaler: Also known as normalization, this technique scales data to range between 0 and 1 (or -1 to 1 if there are negative values). It's commonly used for algorithms that don't assume any distribution of the data. This means if your data doesn't follow a specific shape or form, you might consider using Min-Max Scaler.
Robust Scaler: As its name suggests, this scaler is robust to outliers. It uses the Interquartile Range (IQR) to scale data, and it's suitable when the dataset contains outliers. Outliers are data points that significantly deviate from other observations. They can be problematic because they can affect the results of a data analysis.
There's no "one size fits all" scaler. You'll need to choose the appropriate scaler based on your data's characteristics and your machine-learning algorithm's requirements.
We'll start with the Standard Scaler. It scales data based on its mean ($\mu$) and standard deviation ($\sigma$), using the formula to calculate the z-score: $z = \frac{x - \mu}{\sigma}$.
Let's try it on the age
column of the Titanic dataset:
Python1import numpy as np 2import seaborn as sns 3from sklearn.preprocessing import StandardScaler 4 5# Load the dataset and drop rows with missing values 6titanic_df = sns.load_dataset('titanic').dropna() 7 8# Initialize the StandardScaler 9std_scaler = StandardScaler() 10 11# Fit and transform the 'age' column 12titanic_df['age'] = std_scaler.fit_transform(np.array(titanic_df['age']).reshape(-1, 1)) 13 14# Check the transformed 'age' column 15print(titanic_df['age'].head()) 16""" 171 0.152082 183 -0.039875 196 1.175852 2010 -2.023430 2111 1.431795 22Name: age, dtype: float64 23"""
Note how the transformed age
values are not easily interpretable. That's because they've been transformed into their respective z-scores. But the important thing to understand is the transformed data is standardized and can be readily included in a machine learning model.
Next, we'll explore Min-Max Scaling, which scales your data to a specified range. The formula used here is: $x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}$. This formula essentially resizes your data to fit within the range of 0 to 1.
Let's apply Min-Max Scaler on the fare
column:
Python1from sklearn.preprocessing import MinMaxScaler 2 3# Initialize the MinMaxScaler 4min_max_scaler = MinMaxScaler() 5 6# Fit and transform the 'fare' column 7titanic_df['fare'] = min_max_scaler.fit_transform(np.array(titanic_df['fare']).reshape(-1, 1)) 8 9# Check the transformed 'fare' column 10print(titanic_df['fare'].head()) 11""" 121 0.139136 133 0.103644 146 0.101229 1510 0.032596 1611 0.051822 17Name: fare, dtype: float64 18"""
All fare values are now within the range of 0 to 1, with the smallest fare being 0 and the largest being 1. Intermediate fare values are spread out proportionally between 0 and 1.
Last but not least, we have Robust Scaling useful when dealing with outliers, as it scales data according to its IQR (Inter Quartile Range). Effectively, it's robust against outliers since it uses the IQR, and outliers fall outside the IQR.
Let's apply it to the fare
column:
Python1from sklearn.preprocessing import RobustScaler 2 3# Initialize the RobustScaler 4robust_scaler = RobustScaler() 5 6# Fit and transform the 'fare' column 7titanic_df['fare'] = robust_scaler.fit_transform(np.array(titanic_df['fare']).reshape(-1, 1)) 8 9# Check the transformed 'fare' column 10print(titanic_df['fare'].head()) 11""" 121 0.236871 133 -0.064677 146 -0.085199 1510 -0.668325 1611 -0.504975 17Name: fare, dtype: float64 18"""
The fare
values now reflect how many IQRs are away from the median. This scaling method is resilient to outliers, which effectively become small positive and negative values.
You should now understand why data scaling is essential in machine learning and how to implement three common data scaling techniques in Python: Standard Scaling
, Min-Max Scaling
, and Robust Scaling
.
Remember, the choice of scaling technique depends on the nature of your data and the specific requirements of your machine-learning algorithm. Each scaler has its strengths: Standard Scaler works best with data that are normally distributed, Min-Max Scaler is adaptable with data of any shape, and Robust Scaler is capable of handling outliers.
Great work on assimilating the essentials of data transformation and scaling! Let's move to the next part—practice! The exercises are designed to deepen your understanding of data scaling techniques. You'll code, implement the learning, and apply these techniques to various data distributions. So roll up your sleeves and get ready for some coding action! You can expect to gain much deeper insights and develop your data scaling expertise during the practice session, so don't miss it!