Wrangling Missing Data: Techniques Applied to the Titanic Dataset

Lesson 2

Lesson Introduction

Welcome to an intriguing lesson on missing data handling! Today, we're diving into the Titanic dataset, a passage in time to the early 20th century. Our main aim? To wrangle missing data using Python and Pandas. Don't worry if you're unfamiliar with these terms yet, we'll break them down one by one!

Python: A high-level, interpreted programming language that is easy to learn yet powerful. It has bundles of libraries, like Pandas, that make data manipulation a breeze.
Pandas: A Python library providing high-performance, easy-to-use data structures and data analysis tools.

By the end of this lesson, you'll understand the basics of handling missing data, which is an essential step in preparing your data for machine learning models. So let's get started!

Understanding Missing Data

As an analyst or data scientist, it's pivotal to understand why data might be missing, as it helps in choosing the best strategy to handle it. Missing data, which are like missing puzzle pieces, can occur due to several reasons, such as not being collected, being recorded incorrectly, or even being lost over time.

Furthermore, missing data can be categorised as:

Missing completely at random (MCAR): The missing data entries are random and don't correlate with any other data.
Missing at random (MAR): The missing values depend on the values of other variables.
Missing not at random (MNAR): The missing values have a particular pattern or logic.

Identifying Missing Values in the Titanic Dataset

Before we can consider how to handle missing data, let's learn how to identify it. We'll use the isnull() and sum() functions from the Pandas library to find the number of missing values in our Titanic dataset:

Python
1import seaborn as sns
2import pandas as pd
3
4# Import Titanic dataset
5titanic_df = sns.load_dataset('titanic')
6
7# Identify missing values
8missing_values = titanic_df.isnull().sum()
9print(missing_values)

The output from this code will be:

Markdown
1survived         0
2pclass           0
3sex              0
4age            177
5sibsp            0
6parch            0
7fare             0
8embarked         2
9class            0
10who              0
11adult_male       0
12deck           688
13embark_town      2
14alive            0
15alone            0
16dtype: int64

In the output, you'll see each column name accompanied by a number that denotes the number of missing values in that column.

Strategies to Handle Missing Data

Armed with the knowledge of missing data and its types, it's time to decide how to handle them. Broadly, you can consider three main strategies:

Deletion: This involves removing the rows and columns containing missing data. However, this might lead to the loss of valuable information.
Imputation: This includes filling missing values with substituted ones, like the mean, median, or mode (the most common value in the data frame).
Prediction: This involves using a predictive model to estimate the missing values.

A balance of intuition, experience, and technical know-how usually dictates the best method to use.

Handling Missing Data in the Titanic Dataset

Let's get our hands dirty and handle missing data firsthand in the Titanic dataset. For the “age” feature, we'll fill in missing entries with the median passenger age. And, for the “deck” feature, where most entries are missing, we'll delete the entire column.

Python
1# Dealing with missing values 
2
3# Dropping columns with excessive missing data
4new_titanic_df = titanic_df.drop(columns=['deck'])
5
6# Imputing median age for missing age data
7new_titanic_df['age'].fillna(new_titanic_df['age'].median(), inplace=True)
8
9# Display the number of missing values post-imputation
10missing_values_updated = new_titanic_df.isnull().sum()
11print(missing_values_updated)

The updated missing values count comes out to be:

Markdown
1survived       0
2pclass         0
3sex            0
4age            0
5sibsp          0
6parch          0
7fare           0
8embarked       2
9class          0
10who            0
11adult_male     0
12embark_town    2
13alive          0
14alone          0
15dtype: int64

As you can see from the updated missing values count, we have successfully handled the missing data! Note that we could also use the dropna() function to handle missing data by removing rows with missing values. However, we should be cautious, as this might remove a significant portion of our data. Here's how you can do it: titanic_df.dropna().

Lesson Summary and Practice

Well done! You have now explored the basics of handling missing data, an essential pre-processing step for any machine-learning model. The skill of dealing with missing data is a key arrow in any data scientist's quiver, ensuring that your data is clean and ready for modeling.

Get set for some upcoming practice sessions that will provide you with opportunities to apply and reinforce what you've learned today. Feel the thrill as we continue venturing deeper into the world of data processing! Nothing should be missing from your data now, so it's time to wield your new skills!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.