Data Transformation with the Titanic Dataset

Lesson 4

Lesson Introduction

A warm welcome to our lesson on Data Transformation. An exciting journey awaits us as we explore different transformations using the Titanic dataset. We'll specifically focus on Pandas, a Python library known for its powerful data manipulation abilities. Data transformation is crucial in handling historical data, such as the Titanic passenger dataset, to prepare it for advanced Machine Learning models. Everything learned in this lesson is foundational and applicable to other types of data. So, let's buckle up and enjoy our adventure into the world of data transformation.

Understanding Data Transformation

Data transformation is at the heart of data analysis and machine learning. It's about converting raw data into a format that's amenable to machine learning models and improving their performance. To illustrate better, imagine you have a dataset containing passengers' ages and incomes. Age could range from 1 to 90, while income ranges from 1000 to 90000. Notice how different these scales are? To reduce the bias in machine learning models due to these vastly differing scales, we would normalize the features with numerical scaling.

On the other hand, we may have categorical features like the 'Embarked' port in the Titanic dataset. Machine learning models don't handle categorical data well, so we need to convert them into a numeric format through One-Hot Encoding.

Let's have a quick look at an example DataFrame before any transformation.

Python
1import pandas as pd
2
3example_data = pd.DataFrame({
4    'age': [23, 45, 56, 78, 21],
5    'income': [5000, 7000, 11000, 8000, 7600],
6    'embarked': ['S', 'C', 'Q', 'S', 'S']
7})
8
9print("Before Transformation:\n", example_data)

The output will be:

Markdown
1Before Transformation:
2    age  income embarked
30   23    5000        S
41   45    7000        C
52   56   11000        Q
63   78    8000        S
74   21    7600        S

This dataset is our starting point: raw, unprocessed, and unprepared for Machine Learning modeling.

Numerical Features Transformation

We can apply an array of transformation techniques to numerical features. In the Titanic dataset, 'age' and 'fare' are such numerical features. For the purpose of this course, we'll demonstrate how to apply the MinMaxScaler, a function from the sklearn.preprocessing library. This technique scales the numerical features within a specified range (the default range is 0 to 1), reducing high variations and standardizing the data. Here's how:

Python
1from sklearn.preprocessing import MinMaxScaler
2
3# Converting dataframe into an array for the scaler
4titanic_df[['age', 'fare']] = MinMaxScaler().fit_transform(titanic_df[['age', 'fare']])
5
6print('After numerical feature scaling:\n', titanic_df[['age','fare']].head())

The output will be:

Markdown
1After numerical feature scaling:
2         age      fare
30  0.271174  0.014151
41  0.472229  0.139136
52  0.321438  0.015469
63  0.434531  0.103644
74  0.434531  0.015713

Now, our numerical features - age and fare, both range in a similar scale of 0 to 1. This reduces the potential of bias and greatly enhances the predictive accuracy of our future model. To adjust the scale range, you can set the feature_range parameter in MinMaxScaler, like MinMaxScaler(feature_range=(0, 10)), to scale between 0 and 10 instead of the default 0 to 1, tailoring the scale to your specific model requirements.

Categorical Features Transformation

One-Hot Encoding is a common way to transform categorical features into a form that can be understood by machine learning algorithms. It involves changing each category value into a new column and assigning a 1 or 0 (True/False) value to the column. Here, 'True' represents the presence of the feature, and 'False' states its absence. Let's try it:

Python
1# Transform the categorical features with One-Hot Encoding
2titanic_df = pd.get_dummies(titanic_df, columns=['sex', 'embarked'])
3
4print('After one-hot encoding of categorical features:\n', titanic_df.head())

The output will be:

Markdown
1After one-hot encoding of categorical features:
2    survived  pclass       age  sibsp  parch      fare  class    who  adult_male deck  embark_town alive  alone  sex_female  sex_male  embarked_C  embarked_Q  embarked_S
30         0       3  0.271174      1      0  0.014151  Third    man        True  NaN  Southampton    no  False       False      True       False       False        True
41         1       1  0.472229      1      0  0.139136  First  woman       False    C    Cherbourg   yes  False        True     False        True       False       False
52         1       3  0.321438      0      0  0.015469  Third  woman       False  NaN  Southampton   yes   True        True     False       False       False        True
63         1       1  0.434531      1      0  0.103644  First  woman       False    C  Southampton   yes  False        True     False       False       False        True
74         0       3  0.434531      0      0  0.015713  Third    man        True  NaN  Southampton    no   True       False      True       False       False        True

Notice how our dataset has 5 new columns now: sex_female, sex_male, embarked_C, embarked_Q, embarked_S. Instead of the sex columns with values M and F, a male passenger now has a sex_male column that is set to True and a sex_female set to False. Note that instead of True/False values, you could also create 1/0 values with the following syntax: sex_dummies = pd.get_dummies(titanic_df['sex'], dtype=int).

Combining Processed Features

Now that we've preprocessed our data, we can bring our transformed features together. This step combines our transformed Numerical and Categorical features into the same DataFrame. For our current example, we conducted transformations subsequently, and therefore, the transformed features are in the same DataFrame titanic_df. But if you perform transformations separately, you can use the pandas pd.concat function to merge your data.

Insights from Data Transformation

Finally, let's look at what we've achieved with our data transformations. Such transformations, when chosen and utilized wisely, can dramatically improve the performance of machine learning models. Numerical scaling or normalization might improve performance for linear models, while One-Hot encoding can greatly cater to models that struggle with categorical variables. But the trick here is to understand your model's assumptions and the nature of your data to choose the right transformations.

Lesson Summary and Practice

Congratulations! You've navigated the path of transforming the Titanic Dataset passenger features. You acquired knowledge about crucial data transformations, applied it to the Titanic passenger features, and appreciated the impact of these transformations on machine learning performance. These skills are essential in data analytics, and the more you practice, the better you'll get at choosing and applying the right transformations. Stay tuned for some exciting practice exercises that will solidify this knowledge and prepare you for future data analysis endeavors. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.