Handling Missing Values

Introduction to Data Cleaning and TransformationLesson 2

Lesson 2

Introduction

Welcome to our Handling Missing Values lesson. Missing values in data sheets can complicate data analysis. Incorrect handling can lead to inaccurate results. So, we'll learn how to manage these values using Python's Pandas.

Missing Data in Datasets

Missing data in datasets is common. It occurs when no data values are stored for certain variable observations. It can cause bias, make some functions inapplicable, and obscure insightful data patterns. Consider a dataset of student scores:

Python
1import pandas as pd
2
3data = {'Name': ['Anna', 'Bob', 'Charlie', 'David', None],
4        'Score': [85, 88, None, 92, 90]}
5df = pd.DataFrame(data)
6print(df)
7# Output:
8#       Name  Score
9# 0     Anna   85.0
10# 1      Bob   88.0
11# 2  Charlie    NaN
12# 3    David   92.0
13# 4     None   90.0

"Charlie" has a missing score (None).

Identifying Missing Values with Pandas

Before handling missing values, we must identify them. Pandas' functions isnull() and notnull() can perform this task. isnull() returns a DataFrame where each cell is either True or False depending on that cell's null status.

From our student scores data:

Python
1print(df.isnull())
2# Output: 
3#    Name  Score
4# 0 False  False
5# 1 False  False
6# 2 False   True
7# 3 False  False
8# 4  True  False

The None (missing) value for "Charlie" returns True when isnull() is used. notnull works similarly, but returns exactly opposite values: True is for present value!:

Python
1print(df.notnull())
2# Output:
3#    Name  Score
4# 0  True   True
5# 1  True   True
6# 2  True  False
7# 3  True   True
8# 4  False  True

Handling Missing Values: Removal, Part 1

After identifying missing values, the next step is handling them. The strategy depends on the nature of our data and analysis purpose. A common strategy is to remove rows with None values using the dropna() function:

Python
1print(df.dropna())
2# Output:
3#    Name  Score
4# 0  Anna   85.0
5# 1   Bob   88.0
6# 3 David   92.0

"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.

Handling Missing Values: Removal, Part 2

To scan only specific columns for missing values with dropna(), you can use the subset argument to specify which columns to check for missing values. Here's an example:

Python
1# Drop rows where 'Score' column has missing values
2print(df.dropna(subset=['Score']))
3# Output:
4#    Name  Score
5# 0  Anna   85.0
6# 1   Bob   88.0
7# 3 David   92.0
8# 4  None   90.0

As you can see, the fourth row is not removed. Though it contains a missing value in the Name column, this time we only remove rows with missing Score

Handling Missing Values: Replacement

Another strategy is to fill missing values with a specific value or a calculated value such as the mean, median, or mode. The fillna() function can achieve this:

Python
1print(df.fillna(0))
2# Output:
3#       Name  Score
4# 0     Anna   85.0
5# 1      Bob   88.0
6# 2  Charlie    0.0
7# 3    David   92.0

"Charlie"'s score is replaced with 0.

Additionally, you can use forward fill (ffill) or backward fill (bfill) to propagate the next or previous value:

Python
1# Using backward fill
2print(df.fillna(method='bfill'))
3# Output:
4#       Name  Score
5# 0     Anna   85.0
6# 1      Bob   88.0
7# 2  Charlie   92.0
8# 3    David   92.0

Here, "Charlie"'s score is filled using the next available score, which is 92.0 from "David". In the newest pandas versions, there are separate methods df.ffill() and df.bfill() for this.

Handling Missing Values in Real-world Scenarios

The real-world strategy to handle missing values relies on the data's nature and the analysis ambition. If we're analyzing average student scores, it may be better to fill missing values with the non-missing values' mean. Here's an example:

Python
1mean_score = df['Score'].mean()
2print(df.fillna(mean_score))
3# Output
4#       Name      Score
5# 0     Anna  85.000000
6# 1      Bob  88.000000
7# 2  Charlie  88.333333
8# 3    David  92.000000

Summary

Handling missing values creates clean datasets, which offer a better basis for data analysis. You've now learned how to handle missing values in Python datasets! Are you ready to practice what we've covered? A hands-on approach is the best way to learn and understand data analysis. You're doing great! Let's keep going!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.