Welcome to an intriguing lesson on missing data handling! Today, we're diving into the Titanic dataset, a passage in time to the early 20th century. Our main aim? To wrangle missing data using Python and Pandas. Don't worry if you're unfamiliar with these terms yet, we'll break them down one by one!
By the end of this lesson, you'll understand the basics of handling missing data, which is an essential step in preparing your data for machine learning models. So let's get started!
As an analyst or data scientist, it's pivotal to understand why data might be missing, as it helps in choosing the best strategy to handle it. Missing data, which are like missing puzzle pieces, can occur due to several reasons, such as not being collected, being recorded incorrectly, or even being lost over time.
Furthermore, missing data can be categorised as:
Before we can consider how to handle missing data, let's learn how to identify it. We'll use the isnull()
and sum()
functions from the Pandas library to find the number of missing values in our Titanic dataset:
Python1import seaborn as sns 2import pandas as pd 3 4# Import Titanic dataset 5titanic_df = sns.load_dataset('titanic') 6 7# Identify missing values 8missing_values = titanic_df.isnull().sum() 9print(missing_values)
The output from this code will be:
Markdown1survived 0 2pclass 0 3sex 0 4age 177 5sibsp 0 6parch 0 7fare 0 8embarked 2 9class 0 10who 0 11adult_male 0 12deck 688 13embark_town 2 14alive 0 15alone 0 16dtype: int64
In the output, you'll see each column name accompanied by a number that denotes the number of missing values in that column.
Armed with the knowledge of missing data and its types, it's time to decide how to handle them. Broadly, you can consider three main strategies:
A balance of intuition, experience, and technical know-how usually dictates the best method to use.
Let's get our hands dirty and handle missing data firsthand in the Titanic dataset. For the “age”
feature, we'll fill in missing entries with the median passenger age. And, for the “deck”
feature, where most entries are missing, we'll delete the entire column.
Python1# Dealing with missing values 2 3# Dropping columns with excessive missing data 4new_titanic_df = titanic_df.drop(columns=['deck']) 5 6# Imputing median age for missing age data 7new_titanic_df['age'].fillna(new_titanic_df['age'].median(), inplace=True) 8 9# Display the number of missing values post-imputation 10missing_values_updated = new_titanic_df.isnull().sum() 11print(missing_values_updated)
The updated missing values count comes out to be:
Markdown1survived 0 2pclass 0 3sex 0 4age 0 5sibsp 0 6parch 0 7fare 0 8embarked 2 9class 0 10who 0 11adult_male 0 12embark_town 2 13alive 0 14alone 0 15dtype: int64
As you can see from the updated missing values count, we have successfully handled the missing data! Note that we could also use the dropna()
function to handle missing data by removing rows with missing values. However, we should be cautious, as this might remove a significant portion of our data. Here's how you can do it: titanic_df.dropna()
.
Well done! You have now explored the basics of handling missing data, an essential pre-processing step for any machine-learning model. The skill of dealing with missing data is a key arrow in any data scientist's quiver, ensuring that your data is clean and ready for modeling.
Get set for some upcoming practice sessions that will provide you with opportunities to apply and reinforce what you've learned today. Feel the thrill as we continue venturing deeper into the world of data processing! Nothing should be missing from your data now, so it's time to wield your new skills!