Lesson 1

Data Cleaning Techniques: Detecting and Handling Missing Data

Intro to Handling Missing Data

In today's lesson, we delve into the topic of handling missing data - a common occurrence in the realm of data cleaning and manipulation. Regardless of the domain, be it retail, healthcare, finance, or any other, dealing with missing data is a crucial step in maintaining the integrity of the dataset and delivering accurate analyses or predictions.

Dealing with missing values is a cornerstone of the data preprocessing pipeline. Data could be missing in real-life scenarios for various reasons - it might not have been collected, perhaps due to human error or system problems. Regardless of why the data is missing, we need to identify and handle these values to ensure that we make accurate and reliable predictions from our data.

Detecting Missing Values in Pandas

Our first step in handling missing data is to detect those missing values. The Pandas library provides us the isnull() function, which returns a Boolean DataFrame of the same shape as our input, indicating with a True or False whether each individual value is missing.

Using our Titanic dataset as an example, let's demonstrate this process:

1import seaborn as sns 2 3# Load the dataset 4titanic_df = sns.load_dataset('titanic') 5 6# Detect missing values 7missing_values = titanic_df.isnull() 8print(missing_values.head(10)) 9""" 10 survived pclass sex age ... deck embark_town alive alone 110 False False False False ... True False False False 121 False False False False ... False False False False 132 False False False False ... True False False False 143 False False False False ... False False False False 154 False False False False ... True False False False 165 False False False True ... True False False False 176 False False False False ... False False False False 187 False False False False ... True False False False 198 False False False False ... True False False False 209 False False False False ... True False False False 21 22[10 rows x 15 columns] 23"""

Here, we have a DataFrame of the same size as titanic_df, but instead of actual data, it holds Boolean values with True indicating the presence of a missing datapoint and False standing for a valid existing data point.

Counting Missing Values in Each Column

While the step above provides a granular look at our missing data, a more top-level view that is often more useful is the number of missing values in each column. To get this, Pandas provides us with a handy method: sum(). After isnull(), it counts each column's total number of True (i.e., missing) values.

1missing_values_count = titanic_df.isnull().sum() 2print(missing_values_count) 3""" 4survived 0 5pclass 0 6sex 0 7age 177 8sibsp 0 9parch 0 10fare 0 11embarked 2 12class 0 13who 0 14adult_male 0 15deck 688 16embark_town 2 17alive 0 18alone 0 19dtype: int64 20"""

This code calculates and prints the number of missing data points in each column, providing an overview of the completeness of the data in our DataFrame.

Dealing with Missing Values: Dropping

Before we proceed to the imputation methods, it is important to mention that sometimes the best way to handle missing data is to drop the rows or columns containing them, especially when the data missing is very little and wouldn't impact our analysis or predictions.

Pandas provides the dropna() function for this purpose. Here's a demonstration:

1# Copy the original dataset 2titanic_df_copy = titanic_df.copy() 3 4# Drop rows with missing values 5titanic_df_copy.dropna(inplace=True) 6 7# Check the dataframe 8print(titanic_df_copy.isnull().sum()) 9# There will be no missing values in every column

In the given example, we used inplace=True to modify the original DataFrame itself.

Visualizing Missing Data with Seaborn

Visualizing data is often more insightful. Seaborn's heat map function offers a convenient tool to scrutinize missing data visually. It uses different color intensities to represent the presence or absence of data:

1import matplotlib.pyplot as plt 2import seaborn as sns 3 4# Detected missing values visualized 5plt.figure(figsize=(10,6)) 6sns.heatmap(titanic_df.isnull(), cmap='viridis') 7plt.show()

Visualization of Detected Missing Data

Handling Missing Values: Imputation

It's time to handle the detected missing values. One common strategy is to fill in the missing data values, known as "imputation". We can do this in several ways based on the nature and distribution of our data.

In the case of the 'age' variable in our Titanic dataset (which is numerical), we can fill in missing values with either the mean, median, or mode of the available values. Here's the method demonstrated with mean:

1# Impute missing values using mean 2titanic_df['age'].fillna(titanic_df['age'].mean(), inplace=True) 3 4# Check the dataframe 5print(titanic_df.isnull().sum()) 6""" 7survived 0 8pclass 0 9sex 0 10age 0 11sibsp 0 12parch 0 13fare 0 14embarked 2 15class 0 16who 0 17adult_male 0 18deck 688 19embark_town 2 20alive 0 21alone 0 22dtype: int64 23"""

Here, the 'age' column's missing values get filled with the mean age. Suddenly, we no longer have any missing values in our 'age' column!

Another variant of the fillna() method involves forward fill or backward fill, where missing values are filled with the previous or next respective value in the DataFrame:

1# Impute missing values using backward fill 2titanic_df['age'].fillna(method='bfill', inplace=True) 3 4# Check the dataframe 5print(titanic_df.isnull().sum()) 6# The output is the same as in the previous example

In the above example, each missing value in the 'age' column is filled with its subsequent value in the DataFrame. Please note again: inplace=True means the change should be reflected in the DataFrame itself.

Wrapping Up

We have now navigated through the important topic of missing data handling. We have learned, with hands-on examples, how to detect, analyze, visualize, and then handle missing data effectively. Remember, the choice of method to handle missing data depends largely on the nature of the data and the domain requirement, making it an essential skill in the field of data preprocessing and analysis.

Now that we have learned about the problem of missing data and explored various strategies to handle it, it's time to apply this knowledge in some practice exercises. Up next is your opportunity to polish these new skills and evaluate your understanding.

Carry on coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.