Lesson 5
Building Full Preprocessing Pipeline for the Titanic Dataset
Lesson Introduction

Welcome! Today, we’ll learn how to build a full preprocessing pipeline for the Titanic dataset. In real work, you are going to deal with big datasets with lots of features and rows.

We aim to learn how to prepare the real data for machine learning models by handling missing values, encoding categorical features, scaling numerical features, and splitting the data into training and test sets.

Imagine you have a messy jigsaw puzzle. You need to organize the pieces, find the edges first, and then start assembling. Data preprocessing is like organizing the pieces before starting the puzzle.

Load and Prepare the Data

Let’s start by loading the Titanic dataset using Seaborn, which has information about passengers like age, fare, and whether they survived. We'll drop some columns we won’t use.

Python
1import pandas as pd 2import seaborn as sns 3 4# Load the Titanic dataset 5df = sns.load_dataset('titanic') 6 7# Drop columns that won't be used 8df = df.drop(columns=['deck', 'embarked', 'alive']) 9 10print(df.head())

Expected output:

1 survived pclass sex age sibsp parch fare who adult_male \ 20 0 3 male 22.0 1 0 7.2500 man True 31 1 1 female 38.0 1 0 71.2833 woman False 42 1 3 female 26.0 0 0 7.9250 woman False 53 1 1 female 35.0 1 0 53.1000 woman False 64 0 3 male 35.0 0 0 8.0500 man True 7 8 embark_town alone 90 Southampton False 101 Cherbourg False 112 Southampton True 123 Southampton False 134 Southampton True

We loaded the dataset and dropped columns deck, embarked, and alive because they have too many missing values or are not useful. For example, embarked column shouldn't affect the passenger's survival's rate, so it is questionable as a feature.

Handle Missing Values

Next, let's handle missing values using SimpleImputer from SciKit Learn.

Python
1from sklearn.impute import SimpleImputer 2 3# Handle missing values 4imputer_num = SimpleImputer(strategy='mean') 5imputer_cat = SimpleImputer(strategy='most_frequent') 6 7df['age'] = imputer_num.fit_transform(df[['age']]) 8df['embark_town'] = imputer_cat.fit_transform(df[['embark_town']].values.reshape(-1, 1)).ravel() 9df['fare'] = imputer_num.fit_transform(df[['fare']]) 10 11print(df.head())

As a reminder, ravel() is a method in NumPy that returns a contiguous flattened array. In this context, it is used to flatten the column vector returned by fit_transform() into a 1-dimensional array. This ensures that the embark_town column is reshaped back into a 1-D array that fits into the DataFrame correctly.

Expected output:

1 survived pclass sex age sibsp parch fare who adult_male \ 20 0 3 male 22.0 1 0 7.2500 man True 31 1 1 female 38.0 1 0 71.2833 woman False 42 1 3 female 26.0 0 0 7.9250 woman False 53 1 1 female 35.0 1 0 53.1000 woman False 64 0 3 male 35.0 0 0 8.0500 man True 7 8 embark_town alone 90 Southampton False 101 Cherbourg False 112 Southampton True 123 Southampton False 134 Southampton True

We filled missing numerical data (age, fare) using the mean and categorical data (embark_town) using the most frequent value. This is like guessing a missing puzzle piece based on surrounding ones.

Encode Categorical Features: Part 1

Machine learning models need numerical data. So, we use OneHotEncoder to convert categorical features into numbers.

Python
1from sklearn.preprocessing import OneHotEncoder 2 3# Encode categorical features 4encoder = OneHotEncoder(sparse_output=False, drop='first') 5encoded_columns = encoder.fit_transform(df[['sex', 'class', 'embark_town', 'who', 'adult_male', 'alone']]) 6encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(['sex', 'class', 'embark_town', 'who', 'adult_male', 'alone']))
Encode Categorical Features: Part 2

Next, we drop the original categorical columns and concatenate the new encoded columns with the DataFrame.

Python
1# Drop and concatenate 2df = df.drop(columns=['sex', 'class', 'embark_town', 'who', 'adult_male', 'alone']) 3df = pd.concat([df.reset_index(drop=True), encoded_df], axis=1) 4 5print(df.head())

Expected output:

1 survived pclass age sibsp parch fare alone sex_male \ 20 0 3 22.0 1 0 7.2500 False 1.0 31 1 1 38.0 1 0 71.2833 False 0.0 42 1 3 26.0 0 0 7.9250 True 0.0 53 1 1 35.0 1 0 53.1000 False 0.0 64 0 3 35.0 0 0 8.0500 True 1.0 7 8 class_2 class_3 embark_town_Queenstown embark_town_Southampton \ 90 0.0 1.0 0.0 1.0 101 0.0 0.0 0.0 0.0 112 0.0 1.0 0.0 1.0 123 0.0 0.0 0.0 1.0 134 0.0 1.0 0.0 1.0 14 15 who_man who_woman adult_male_True 160 1.0 0.0 1.0 171 0.0 1.0 0.0 182 0.0 1.0 0.0 193 0.0 1.0 0.0 204 1.0 0.0 1.0

We converted the categorical columns into numerical ones, dropped the originals, and added the new encoded columns. It's like translating words into a secret code for a robot.

Feature Scaling

Feature scaling ensures all numerical values are on a similar scale. We use StandardScaler for this.

Python
1from sklearn.preprocessing import StandardScaler 2 3# Feature scaling 4scaler = StandardScaler() 5scaled_columns = scaler.fit_transform(df[['age', 'fare']]) 6scaled_df = pd.DataFrame(scaled_columns, columns=['age', 'fare']) 7 8# Drop and concatenate 9df = df.drop(columns=['age', 'fare']) 10df = pd.concat([df.reset_index(drop=True), scaled_df], axis=1) 11 12print(df.head())

Expected output:

1 survived pclass sibsp parch alone sex_male class_2 class_3 \ 20 0 3 1 0 False 1.0 0.0 1.0 31 1 1 1 0 False 0.0 0.0 0.0 42 1 3 0 0 True 0.0 0.0 1.0 53 1 1 1 0 False 0.0 0.0 0.0 64 0 3 0 0 True 1.0 0.0 1.0 7 8 embark_town_Queenstown embark_town_Southampton who_man who_woman \ 90 0.0 1.0 1.0 0.0 101 0.0 0.0 0.0 1.0 112 0.0 1.0 0.0 1.0 123 0.0 1.0 0.0 1.0 134 0.0 1.0 1.0 0.0 14 15 adult_male_True age fare 160 1.0 -0.530376 -0.502445 171 0.0 0.571829 0.788947 182 0.0 -0.254596 -0.488854 193 0.0 0.400810 0.420731 204 1.0 0.400810 -0.486337

We scaled our numerical data (age, fare) to have a mean of 0 and a standard deviation of 1. This is like resizing puzzle pieces to fit perfectly.

Separate Features and Target Variable

Next, we separate our features (used for predictions) and the target variable (the outcome we predict).

Python
1# Separate features and target variable 2X = df.drop(columns=['survived']) 3y = df['survived'] 4 5print("X:\n", X.head()) 6print("\ny:\n", y.head())

Expected output:

1X: 2 pclass sibsp parch alone sex_male class_2 class_3 embark_town_Queenstown \ 30 3 1 0 False 1.0 0.0 1.0 0.0 41 1 1 0 False 0.0 0.0 0.0 0.0 52 3 0 0 True 0.0 0.0 1.0 0.0 63 1 1 0 False 0.0 0.0 0.0 0.0 74 3 0 0 True 1.0 0.0 1.0 0.0 8 9 embark_town_Southampton who_man who_woman adult_male_True age \ 100 1.0 1.0 0.0 1.0 -0.530376 111 0.0 0.0 1.0 0.0 0.571829 122 1.0 0.0 1.0 0.0 -0.254596 133 1.0 0.0 1.0 0.0 0.400810 144 1.0 1.0 0.0 1.0 0.400810 15 16 fare 170 -0.502445 181 0.788947 192 -0.488854 203 0.420731 214 -0.486337 22 23y: 240 0 251 1 262 1 273 1 284 0 29Name: survived, dtype: int64

Here, X contains all features except survived, and y contains the survived column. This helps in training the model more efficiently.

Train-Test Split

Finally, we split the dataset into training and test sets using train_test_split. This lets us train the model on one part of the data and test it on another.

Python
1from sklearn.model_selection import train_test_split 2 3# Train-test split 4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 5 6print(f"Training set size: {len(X_train)}, Test set size: {len(X_test)}")

Expected output:

1Training set size: 712, Test set size: 179

We split the data so 80% is used for training and 20% for testing. This step is like practicing with some pieces before trying the whole puzzle.

Lesson Summary

Today, we:

  1. Loaded and prepared the Titanic dataset.
  2. Handled missing values.
  3. Encoded categorical features.
  4. Scaled numerical features.
  5. Separated features and the target variable.
  6. Split the dataset into training and test sets.

Now, you'll get to practice these steps hands-on. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.