Lesson 4
Data Transformation with the Titanic Dataset
Lesson Introduction

A warm welcome to our lesson on Data Transformation. An exciting journey awaits us as we explore different transformations using the Titanic dataset. We'll specifically focus on Pandas, a Python library known for its powerful data manipulation abilities. Data transformation is crucial in handling historical data, such as the Titanic passenger dataset, to prepare it for advanced Machine Learning models. Everything learned in this lesson is foundational and applicable to other types of data. So, let's buckle up and enjoy our adventure into the world of data transformation.

Understanding Data Transformation

Data transformation is at the heart of data analysis and machine learning. It's about converting raw data into a format that's amenable to machine learning models and improving their performance. To illustrate better, imagine you have a dataset containing passengers' ages and incomes. Age could range from 1 to 90, while income ranges from 1000 to 90000. Notice how different these scales are? To reduce the bias in machine learning models due to these vastly differing scales, we would normalize the features with numerical scaling.

On the other hand, we may have categorical features like the 'Embarked' port in the Titanic dataset. Machine learning models don't handle categorical data well, so we need to convert them into a numeric format through One-Hot Encoding.

Let's have a quick look at an example DataFrame before any transformation.

Python
1import pandas as pd 2 3example_data = pd.DataFrame({ 4 'age': [23, 45, 56, 78, 21], 5 'income': [5000, 7000, 11000, 8000, 7600], 6 'embarked': ['S', 'C', 'Q', 'S', 'S'] 7}) 8 9print("Before Transformation:\n", example_data)

The output will be:

Markdown
1Before Transformation: 2 age income embarked 30 23 5000 S 41 45 7000 C 52 56 11000 Q 63 78 8000 S 74 21 7600 S

This dataset is our starting point: raw, unprocessed, and unprepared for Machine Learning modeling.

Numerical Features Transformation

We can apply an array of transformation techniques to numerical features. In the Titanic dataset, 'age' and 'fare' are such numerical features. For the purpose of this course, we'll demonstrate how to apply the MinMaxScaler, a function from the sklearn.preprocessing library. This technique scales the numerical features within a specified range (the default range is 0 to 1), reducing high variations and standardizing the data. Here's how:

Python
1from sklearn.preprocessing import MinMaxScaler 2 3# Converting dataframe into an array for the scaler 4titanic_df[['age', 'fare']] = MinMaxScaler().fit_transform(titanic_df[['age', 'fare']]) 5 6print('After numerical feature scaling:\n', titanic_df[['age','fare']].head())

The output will be:

Markdown
1After numerical feature scaling: 2 age fare 30 0.271174 0.014151 41 0.472229 0.139136 52 0.321438 0.015469 63 0.434531 0.103644 74 0.434531 0.015713

Now, our numerical features - age and fare, both range in a similar scale of 0 to 1. This reduces the potential of bias and greatly enhances the predictive accuracy of our future model. To adjust the scale range, you can set the feature_range parameter in MinMaxScaler, like MinMaxScaler(feature_range=(0, 10)), to scale between 0 and 10 instead of the default 0 to 1, tailoring the scale to your specific model requirements.

Categorical Features Transformation

One-Hot Encoding is a common way to transform categorical features into a form that can be understood by machine learning algorithms. It involves changing each category value into a new column and assigning a 1 or 0 (True/False) value to the column. Here, 'True' represents the presence of the feature, and 'False' states its absence. Let's try it:

Python
1# Transform the categorical features with One-Hot Encoding 2titanic_df = pd.get_dummies(titanic_df, columns=['sex', 'embarked']) 3 4print('After one-hot encoding of categorical features:\n', titanic_df.head())

The output will be:

Markdown
1After one-hot encoding of categorical features: 2 survived pclass age sibsp parch fare class who adult_male deck embark_town alive alone sex_female sex_male embarked_C embarked_Q embarked_S 30 0 3 0.271174 1 0 0.014151 Third man True NaN Southampton no False False True False False True 41 1 1 0.472229 1 0 0.139136 First woman False C Cherbourg yes False True False True False False 52 1 3 0.321438 0 0 0.015469 Third woman False NaN Southampton yes True True False False False True 63 1 1 0.434531 1 0 0.103644 First woman False C Southampton yes False True False False False True 74 0 3 0.434531 0 0 0.015713 Third man True NaN Southampton no True False True False False True

Notice how our dataset has 5 new columns now: sex_female, sex_male, embarked_C, embarked_Q, embarked_S. Instead of the sex columns with values M and F, a male passenger now has a sex_male column that is set to True and a sex_female set to False. Note that instead of True/False values, you could also create 1/0 values with the following syntax: sex_dummies = pd.get_dummies(titanic_df['sex'], dtype=int).

Combining Processed Features

Now that we've preprocessed our data, we can bring our transformed features together. This step combines our transformed Numerical and Categorical features into the same DataFrame. For our current example, we conducted transformations subsequently, and therefore, the transformed features are in the same DataFrame titanic_df. But if you perform transformations separately, you can use the pandas pd.concat function to merge your data.

Insights from Data Transformation

Finally, let's look at what we've achieved with our data transformations. Such transformations, when chosen and utilized wisely, can dramatically improve the performance of machine learning models. Numerical scaling or normalization might improve performance for linear models, while One-Hot encoding can greatly cater to models that struggle with categorical variables. But the trick here is to understand your model's assumptions and the nature of your data to choose the right transformations.

Lesson Summary and Practice

Congratulations! You've navigated the path of transforming the Titanic Dataset passenger features. You acquired knowledge about crucial data transformations, applied it to the Titanic passenger features, and appreciated the impact of these transformations on machine learning performance. These skills are essential in data analytics, and the more you practice, the better you'll get at choosing and applying the right transformations. Stay tuned for some exciting practice exercises that will solidify this knowledge and prepare you for future data analysis endeavors. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.