Building Full Preprocessing Pipeline for the Titanic Dataset

Introduction to Machine Learning with SciKit Learn

Data Preprocessing For Machine LearningLesson 5

Lesson 5

Building Full Preprocessing Pipeline for the Titanic Dataset

Lesson Introduction

Welcome! Today, we’ll learn how to build a full preprocessing pipeline for the Titanic dataset. In real work, you are going to deal with big datasets with lots of features and rows.

We aim to learn how to prepare the real data for machine learning models by handling missing values, encoding categorical features, scaling numerical features, and splitting the data into training and test sets.

Imagine you have a messy jigsaw puzzle. You need to organize the pieces, find the edges first, and then start assembling. Data preprocessing is like organizing the pieces before starting the puzzle.

Load and Prepare the Data

Let’s start by loading the Titanic dataset using Seaborn, which has information about passengers like age, fare, and whether they survived. We'll drop some columns we won’t use.

Python
1import pandas as pd
2import seaborn as sns
3
4# Load the Titanic dataset
5df = sns.load_dataset('titanic')
6
7# Drop columns that won't be used
8df = df.drop(columns=['deck', 'embarked', 'alive'])
9
10print(df.head())

Expected output:


1   survived  pclass     sex   age  sibsp  parch     fare  who  adult_male  \
20         0       3    male  22.0      1      0   7.2500  man        True   
31         1       1  female  38.0      1      0  71.2833  woman      False   
42         1       3  female  26.0      0      0   7.9250  woman      False   
53         1       1  female  35.0      1      0  53.1000  woman      False   
64         0       3    male  35.0      0      0   8.0500  man        True   
7
8     embark_town  alone  
90  Southampton    False  
101    Cherbourg    False  
112  Southampton     True  
123  Southampton    False  
134  Southampton     True

We loaded the dataset and dropped columns deck, embarked, and alive because they have too many missing values or are not useful. For example, embarked column shouldn't affect the passenger's survival's rate, so it is questionable as a feature.

Handle Missing Values

Next, let's handle missing values using SimpleImputer from SciKit Learn.

Python
1from sklearn.impute import SimpleImputer
2
3# Handle missing values
4imputer_num = SimpleImputer(strategy='mean')
5imputer_cat = SimpleImputer(strategy='most_frequent')
6
7df['age'] = imputer_num.fit_transform(df[['age']])
8df['embark_town'] = imputer_cat.fit_transform(df[['embark_town']].values.reshape(-1, 1)).ravel()
9df['fare'] = imputer_num.fit_transform(df[['fare']])
10
11print(df.head())

As a reminder, ravel() is a method in NumPy that returns a contiguous flattened array. In this context, it is used to flatten the column vector returned by fit_transform() into a 1-dimensional array. This ensures that the embark_town column is reshaped back into a 1-D array that fits into the DataFrame correctly.

Expected output:


1   survived  pclass     sex   age  sibsp  parch     fare  who  adult_male  \
20         0       3    male  22.0      1      0   7.2500  man        True   
31         1       1  female  38.0      1      0  71.2833  woman      False   
42         1       3  female  26.0      0      0   7.9250  woman      False   
53         1       1  female  35.0      1      0  53.1000  woman      False   
64         0       3    male  35.0      0      0   8.0500  man        True   
7
8     embark_town  alone  
90  Southampton    False  
101    Cherbourg    False  
112  Southampton     True  
123  Southampton    False  
134  Southampton     True

We filled missing numerical data (age, fare) using the mean and categorical data (embark_town) using the most frequent value. This is like guessing a missing puzzle piece based on surrounding ones.

Encode Categorical Features: Part 1

Machine learning models need numerical data. So, we use OneHotEncoder to convert categorical features into numbers.

Python
1from sklearn.preprocessing import OneHotEncoder
2
3# Encode categorical features
4encoder = OneHotEncoder(sparse_output=False, drop='first')
5encoded_columns = encoder.fit_transform(df[['sex', 'class', 'embark_town', 'who', 'adult_male', 'alone']])
6encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(['sex', 'class', 'embark_town', 'who', 'adult_male', 'alone']))

Encode Categorical Features: Part 2

Next, we drop the original categorical columns and concatenate the new encoded columns with the DataFrame.

Python
1# Drop and concatenate
2df = df.drop(columns=['sex', 'class', 'embark_town', 'who', 'adult_male', 'alone'])
3df = pd.concat([df.reset_index(drop=True), encoded_df], axis=1)
4
5print(df.head())

Expected output:


1   survived  pclass   age  sibsp  parch     fare  alone  sex_male  \
20         0       3  22.0      1      0   7.2500  False       1.0   
31         1       1  38.0      1      0  71.2833  False       0.0   
42         1       3  26.0      0      0   7.9250   True       0.0   
53         1       1  35.0      1      0  53.1000  False       0.0   
64         0       3  35.0      0      0   8.0500   True       1.0   
7
8   class_2  class_3  embark_town_Queenstown  embark_town_Southampton  \
90      0.0      1.0                     0.0                      1.0   
101      0.0      0.0                     0.0                      0.0   
112      0.0      1.0                     0.0                      1.0   
123      0.0      0.0                     0.0                      1.0   
134      0.0      1.0                     0.0                      1.0   
14
15   who_man  who_woman  adult_male_True  
160      1.0        0.0              1.0  
171      0.0        1.0              0.0  
182      0.0        1.0              0.0  
193      0.0        1.0              0.0  
204      1.0        0.0              1.0

We converted the categorical columns into numerical ones, dropped the originals, and added the new encoded columns. It's like translating words into a secret code for a robot.

Feature Scaling

Feature scaling ensures all numerical values are on a similar scale. We use StandardScaler for this.

Python
1from sklearn.preprocessing import StandardScaler
2
3# Feature scaling
4scaler = StandardScaler()
5scaled_columns = scaler.fit_transform(df[['age', 'fare']])
6scaled_df = pd.DataFrame(scaled_columns, columns=['age', 'fare'])
7
8# Drop and concatenate
9df = df.drop(columns=['age', 'fare'])
10df = pd.concat([df.reset_index(drop=True), scaled_df], axis=1)
11
12print(df.head())

Expected output:


1   survived  pclass  sibsp  parch  alone  sex_male  class_2  class_3  \
20         0       3      1      0  False       1.0      0.0      1.0   
31         1       1      1      0  False       0.0      0.0      0.0   
42         1       3      0      0   True       0.0      0.0      1.0   
53         1       1      1      0  False       0.0      0.0      0.0   
64         0       3      0      0   True       1.0      0.0      1.0   
7
8   embark_town_Queenstown  embark_town_Southampton  who_man  who_woman  \
90                     0.0                      1.0      1.0        0.0   
101                     0.0                      0.0      0.0        1.0   
112                     0.0                      1.0      0.0        1.0   
123                     0.0                      1.0      0.0        1.0   
134                     0.0                      1.0      1.0        0.0   
14
15   adult_male_True       age      fare  
160              1.0 -0.530376 -0.502445  
171              0.0  0.571829  0.788947  
182              0.0 -0.254596 -0.488854  
193              0.0  0.400810  0.420731  
204              1.0  0.400810 -0.486337

We scaled our numerical data (age, fare) to have a mean of 0 and a standard deviation of 1. This is like resizing puzzle pieces to fit perfectly.

Separate Features and Target Variable

Next, we separate our features (used for predictions) and the target variable (the outcome we predict).

Python
1# Separate features and target variable
2X = df.drop(columns=['survived'])
3y = df['survived']
4
5print("X:\n", X.head())
6print("\ny:\n", y.head())

Expected output:


1X:
2   pclass  sibsp  parch  alone  sex_male  class_2  class_3  embark_town_Queenstown  \
30       3      1      0  False       1.0      0.0      1.0                     0.0   
41       1      1      0  False       0.0      0.0      0.0                     0.0   
52       3      0      0   True       0.0      0.0      1.0                     0.0   
63       1      1      0  False       0.0      0.0      0.0                     0.0   
74       3      0      0   True       1.0      0.0      1.0                     0.0   
8
9   embark_town_Southampton  who_man  who_woman  adult_male_True       age  \
100                      1.0      1.0        0.0              1.0 -0.530376   
111                      0.0      0.0        1.0              0.0  0.571829   
122                      1.0      0.0        1.0              0.0 -0.254596   
133                      1.0      0.0        1.0              0.0  0.400810   
144                      1.0      1.0        0.0              1.0  0.400810   
15
16       fare  
170 -0.502445  
181  0.788947  
192 -0.488854  
203  0.420731  
214 -0.486337  
22
23y:
240    0
251    1
262    1
273    1
284    0
29Name: survived, dtype: int64

Here, X contains all features except survived, and y contains the survived column. This helps in training the model more efficiently.

Train-Test Split

Finally, we split the dataset into training and test sets using train_test_split. This lets us train the model on one part of the data and test it on another.

Python
1from sklearn.model_selection import train_test_split
2
3# Train-test split
4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5
6print(f"Training set size: {len(X_train)}, Test set size: {len(X_test)}")

Expected output:


1Training set size: 712, Test set size: 179

We split the data so 80% is used for training and 20% for testing. This step is like practicing with some pieces before trying the whole puzzle.

Lesson Summary

Today, we:

Loaded and prepared the Titanic dataset.
Handled missing values.
Encoded categorical features.
Scaled numerical features.
Separated features and the target variable.
Split the dataset into training and test sets.

Now, you'll get to practice these steps hands-on. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.