A warm welcome to our lesson on Data Transformation. An exciting journey awaits us as we explore different transformations using the Titanic dataset. We'll specifically focus on Pandas
, a Python library known for its powerful data manipulation abilities. Data transformation is crucial in handling historical data, such as the Titanic passenger dataset, to prepare it for advanced Machine Learning models. Everything learned in this lesson is foundational and applicable to other types of data. So, let's buckle up and enjoy our adventure into the world of data transformation.
Data transformation is at the heart of data analysis and machine learning. It's about converting raw data into a format that's amenable to machine learning models and improving their performance. To illustrate better, imagine you have a dataset containing passengers' ages and incomes. Age could range from 1 to 90, while income ranges from 1000 to 90000. Notice how different these scales are? To reduce the bias in machine learning models due to these vastly differing scales, we would normalize the features with numerical scaling.
On the other hand, we may have categorical features like the 'Embarked'
port in the Titanic dataset. Machine learning models don't handle categorical data well, so we need to convert them into a numeric format through One-Hot Encoding.
Let's have a quick look at an example DataFrame before any transformation.
Python1import pandas as pd 2 3example_data = pd.DataFrame({ 4 'age': [23, 45, 56, 78, 21], 5 'income': [5000, 7000, 11000, 8000, 7600], 6 'embarked': ['S', 'C', 'Q', 'S', 'S'] 7}) 8 9print("Before Transformation:\n", example_data)
The output will be:
Markdown1Before Transformation:
2 age income embarked
30 23 5000 S
41 45 7000 C
52 56 11000 Q
63 78 8000 S
74 21 7600 S
This dataset is our starting point: raw, unprocessed, and unprepared for Machine Learning modeling.
We can apply an array of transformation techniques to numerical features. In the Titanic dataset, 'age'
and 'fare'
are such numerical features. For the purpose of this course, we'll demonstrate how to apply the MinMaxScaler
, a function from the sklearn.preprocessing
library. This technique scales the numerical features within a specified range (the default range is 0 to 1), reducing high variations and standardizing the data. Here's how:
Python1from sklearn.preprocessing import MinMaxScaler 2 3# Converting dataframe into an array for the scaler 4titanic_df[['age', 'fare']] = MinMaxScaler().fit_transform(titanic_df[['age', 'fare']]) 5 6print('After numerical feature scaling:\n', titanic_df[['age','fare']].head())
The output will be:
Markdown1After numerical feature scaling:
2 age fare
30 0.271174 0.014151
41 0.472229 0.139136
52 0.321438 0.015469
63 0.434531 0.103644
74 0.434531 0.015713
Now, our numerical features - age
and fare
, both range in a similar scale of 0 to 1. This reduces the potential of bias and greatly enhances the predictive accuracy of our future model. To adjust the scale range, you can set the feature_range parameter in MinMaxScaler, like MinMaxScaler(feature_range=(0, 10))
, to scale between 0 and 10 instead of the default 0 to 1, tailoring the scale to your specific model requirements.
One-Hot Encoding is a common way to transform categorical features into a form that can be understood by machine learning algorithms. It involves changing each category value into a new column and assigning a 1 or 0 (True/False) value to the column. Here, 'True' represents the presence of the feature, and 'False' states its absence. Let's try it:
Python1# Transform the categorical features with One-Hot Encoding 2titanic_df = pd.get_dummies(titanic_df, columns=['sex', 'embarked']) 3 4print('After one-hot encoding of categorical features:\n', titanic_df.head())
The output will be:
Markdown1After one-hot encoding of categorical features:
2 survived pclass age sibsp parch fare class who adult_male deck embark_town alive alone sex_female sex_male embarked_C embarked_Q embarked_S
30 0 3 0.271174 1 0 0.014151 Third man True NaN Southampton no False False True False False True
41 1 1 0.472229 1 0 0.139136 First woman False C Cherbourg yes False True False True False False
52 1 3 0.321438 0 0 0.015469 Third woman False NaN Southampton yes True True False False False True
63 1 1 0.434531 1 0 0.103644 First woman False C Southampton yes False True False False False True
74 0 3 0.434531 0 0 0.015713 Third man True NaN Southampton no True False True False False True
Notice how our dataset has 5 new columns now: sex_female
, sex_male
, embarked_C
, embarked_Q
, embarked_S
. Instead of the sex
columns with values M
and F
, a male passenger now has a sex_male
column that is set to True
and a sex_female
set to False
. Note that instead of True
/False
values, you could also create 1
/0
values with the following syntax: sex_dummies = pd.get_dummies(titanic_df['sex'], dtype=int)
.
Now that we've preprocessed our data, we can bring our transformed features together. This step combines our transformed Numerical and Categorical features into the same DataFrame. For our current example, we conducted transformations subsequently, and therefore, the transformed features are in the same DataFrame titanic_df
. But if you perform transformations separately, you can use the pandas pd.concat
function to merge your data.
Finally, let's look at what we've achieved with our data transformations. Such transformations, when chosen and utilized wisely, can dramatically improve the performance of machine learning models. Numerical scaling or normalization might improve performance for linear models, while One-Hot encoding can greatly cater to models that struggle with categorical variables. But the trick here is to understand your model's assumptions and the nature of your data to choose the right transformations.
Congratulations! You've navigated the path of transforming the Titanic Dataset passenger features. You acquired knowledge about crucial data transformations, applied it to the Titanic passenger features, and appreciated the impact of these transformations on machine learning performance. These skills are essential in data analytics, and the more you practice, the better you'll get at choosing and applying the right transformations. Stay tuned for some exciting practice exercises that will solidify this knowledge and prepare you for future data analysis endeavors. Happy learning!