Lesson 2

Handling Missing Values

Introduction

Welcome to our Handling Missing Values lesson. Missing values in data sheets can complicate data analysis. Incorrect handling can lead to inaccurate results. So, we'll learn how to manage these values using Python's Pandas.

Missing Data in Datasets

Missing data in datasets is common. It occurs when no data values are stored for certain variable observations. It can cause bias, make some functions inapplicable, and obscure insightful data patterns. Consider a dataset of student scores:

Python
1import pandas as pd 2 3data = {'Name': ['Anna', 'Bob', 'Charlie', 'David', None], 4 'Score': [85, 88, None, 92, 90]} 5df = pd.DataFrame(data) 6print(df) 7# Output: 8# Name Score 9# 0 Anna 85.0 10# 1 Bob 88.0 11# 2 Charlie NaN 12# 3 David 92.0 13# 4 None 90.0

"Charlie" has a missing score (None).

Identifying Missing Values with Pandas

Before handling missing values, we must identify them. Pandas' functions isnull() and notnull() can perform this task. isnull() returns a DataFrame where each cell is either True or False depending on that cell's null status.

From our student scores data:

Python
1print(df.isnull()) 2# Output: 3# Name Score 4# 0 False False 5# 1 False False 6# 2 False True 7# 3 False False 8# 4 True False

The None (missing) value for "Charlie" returns True when isnull() is used. notnull works similarly, but returns exactly opposite values: True is for present value!:

Python
1print(df.notnull()) 2# Output: 3# Name Score 4# 0 True True 5# 1 True True 6# 2 True False 7# 3 True True 8# 4 False True
Handling Missing Values: Removal, Part 1

After identifying missing values, the next step is handling them. The strategy depends on the nature of our data and analysis purpose. A common strategy is to remove rows with None values using the dropna() function:

Python
1print(df.dropna()) 2# Output: 3# Name Score 4# 0 Anna 85.0 5# 1 Bob 88.0 6# 3 David 92.0

"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.

Handling Missing Values: Removal, Part 2

To scan only specific columns for missing values with dropna(), you can use the subset argument to specify which columns to check for missing values. Here's an example:

Python
1# Drop rows where 'Score' column has missing values 2print(df.dropna(subset=['Score'])) 3# Output: 4# Name Score 5# 0 Anna 85.0 6# 1 Bob 88.0 7# 3 David 92.0 8# 4 None 90.0

As you can see, the fourth row is not removed. Though it contains a missing value in the Name column, this time we only remove rows with missing Score

Handling Missing Values: Replacement

Another strategy is to fill missing values with a specific value or a calculated value such as the mean, median, or mode. The fillna() function can achieve this:

Python
1print(df.fillna(0)) 2# Output: 3# Name Score 4# 0 Anna 85.0 5# 1 Bob 88.0 6# 2 Charlie 0.0 7# 3 David 92.0

"Charlie"'s score is replaced with 0.

Additionally, you can use forward fill (ffill) or backward fill (bfill) to propagate the next or previous value:

Python
1# Using backward fill 2print(df.fillna(method='bfill')) 3# Output: 4# Name Score 5# 0 Anna 85.0 6# 1 Bob 88.0 7# 2 Charlie 92.0 8# 3 David 92.0

Here, "Charlie"'s score is filled using the next available score, which is 92.0 from "David". In the newest pandas versions, there are separate methods df.ffill() and df.bfill() for this.

Handling Missing Values in Real-world Scenarios

The real-world strategy to handle missing values relies on the data's nature and the analysis ambition. If we're analyzing average student scores, it may be better to fill missing values with the non-missing values' mean. Here's an example:

Python
1mean_score = df['Score'].mean() 2print(df.fillna(mean_score)) 3# Output 4# Name Score 5# 0 Anna 85.000000 6# 1 Bob 88.000000 7# 2 Charlie 88.333333 8# 3 David 92.000000
Summary

Handling missing values creates clean datasets, which offer a better basis for data analysis. You've now learned how to handle missing values in Python datasets! Are you ready to practice what we've covered? A hands-on approach is the best way to learn and understand data analysis. You're doing great! Let's keep going!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.