Data Cleaning Techniques: Detecting and Handling Missing Data

Intro to Data Cleaning and Preprocessing with TitanicLesson 1

Lesson 1

Intro to Handling Missing Data

In today's lesson, we delve into the topic of handling missing data - a common occurrence in the realm of data cleaning and manipulation. Regardless of the domain, be it retail, healthcare, finance, or any other, dealing with missing data is a crucial step in maintaining the integrity of the dataset and delivering accurate analyses or predictions.

Dealing with missing values is a cornerstone of the data preprocessing pipeline. Data could be missing in real-life scenarios for various reasons - it might not have been collected, perhaps due to human error or system problems. Regardless of why the data is missing, we need to identify and handle these values to ensure that we make accurate and reliable predictions from our data.

Detecting Missing Values in Pandas

Our first step in handling missing data is to detect those missing values. The Pandas library provides us the isnull() function, which returns a Boolean DataFrame of the same shape as our input, indicating with a True or False whether each individual value is missing.

Using our Titanic dataset as an example, let's demonstrate this process:

Python
1import seaborn as sns
2
3# Load the dataset
4titanic_df = sns.load_dataset('titanic')
5
6# Detect missing values 
7missing_values = titanic_df.isnull()
8print(missing_values.head(10))
9"""
10   survived  pclass    sex    age  ...   deck  embark_town  alive  alone
110     False   False  False  False  ...   True        False  False  False
121     False   False  False  False  ...  False        False  False  False
132     False   False  False  False  ...   True        False  False  False
143     False   False  False  False  ...  False        False  False  False
154     False   False  False  False  ...   True        False  False  False
165     False   False  False   True  ...   True        False  False  False
176     False   False  False  False  ...  False        False  False  False
187     False   False  False  False  ...   True        False  False  False
198     False   False  False  False  ...   True        False  False  False
209     False   False  False  False  ...   True        False  False  False
21
22[10 rows x 15 columns]
23"""

Here, we have a DataFrame of the same size as titanic_df, but instead of actual data, it holds Boolean values with True indicating the presence of a missing datapoint and False standing for a valid existing data point.

Counting Missing Values in Each Column

While the step above provides a granular look at our missing data, a more top-level view that is often more useful is the number of missing values in each column. To get this, Pandas provides us with a handy method: sum(). After isnull(), it counts each column's total number of True (i.e., missing) values.

Python
1missing_values_count = titanic_df.isnull().sum()
2print(missing_values_count)
3"""
4survived         0
5pclass           0
6sex              0
7age            177
8sibsp            0
9parch            0
10fare             0
11embarked         2
12class            0
13who              0
14adult_male       0
15deck           688
16embark_town      2
17alive            0
18alone            0
19dtype: int64
20"""

This code calculates and prints the number of missing data points in each column, providing an overview of the completeness of the data in our DataFrame.

Dealing with Missing Values: Dropping

Before we proceed to the imputation methods, it is important to mention that sometimes the best way to handle missing data is to drop the rows or columns containing them, especially when the data missing is very little and wouldn't impact our analysis or predictions.

Pandas provides the dropna() function for this purpose. Here's a demonstration:

Python
1# Copy the original dataset
2titanic_df_copy = titanic_df.copy()
3
4# Drop rows with missing values
5titanic_df_copy.dropna(inplace=True)
6
7# Check the dataframe
8print(titanic_df_copy.isnull().sum())
9# There will be no missing values in every column

In the given example, we used inplace=True to modify the original DataFrame itself.

Visualizing Missing Data with Seaborn

Visualizing data is often more insightful. Seaborn's heat map function offers a convenient tool to scrutinize missing data visually. It uses different color intensities to represent the presence or absence of data:

Python
1import matplotlib.pyplot as plt
2import seaborn as sns
3
4# Detected missing values visualized
5plt.figure(figsize=(10,6))
6sns.heatmap(titanic_df.isnull(), cmap='viridis')
7plt.show()

Visualization of Detected Missing Data

Handling Missing Values: Imputation

It's time to handle the detected missing values. One common strategy is to fill in the missing data values, known as "imputation". We can do this in several ways based on the nature and distribution of our data.

In the case of the 'age' variable in our Titanic dataset (which is numerical), we can fill in missing values with either the mean, median, or mode of the available values. Here's the method demonstrated with mean:

Python
1# Impute missing values using mean
2titanic_df['age'].fillna(titanic_df['age'].mean(), inplace=True)
3
4# Check the dataframe
5print(titanic_df.isnull().sum())
6"""
7survived         0
8pclass           0
9sex              0
10age              0
11sibsp            0
12parch            0
13fare             0
14embarked         2
15class            0
16who              0
17adult_male       0
18deck           688
19embark_town      2
20alive            0
21alone            0
22dtype: int64
23"""

Here, the 'age' column's missing values get filled with the mean age. Suddenly, we no longer have any missing values in our 'age' column!

Another variant of the fillna() method involves forward fill or backward fill, where missing values are filled with the previous or next respective value in the DataFrame:

Python
1# Impute missing values using backward fill
2titanic_df['age'].fillna(method='bfill', inplace=True)
3
4# Check the dataframe
5print(titanic_df.isnull().sum())
6# The output is the same as in the previous example

In the above example, each missing value in the 'age' column is filled with its subsequent value in the DataFrame. Please note again: inplace=True means the change should be reflected in the DataFrame itself.

Wrapping Up

We have now navigated through the important topic of missing data handling. We have learned, with hands-on examples, how to detect, analyze, visualize, and then handle missing data effectively. Remember, the choice of method to handle missing data depends largely on the nature of the data and the domain requirement, making it an essential skill in the field of data preprocessing and analysis.

Now that we have learned about the problem of missing data and explored various strategies to handle it, it's time to apply this knowledge in some practice exercises. Up next is your opportunity to polish these new skills and evaluate your understanding.

Carry on coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.