Welcome to our Handling Missing Values lesson. Missing values in data sheets can complicate data analysis. Incorrect handling can lead to inaccurate results. So, we'll learn how to manage these values using Python's Pandas
.
Missing data in datasets is common. It occurs when no data values are stored for certain variable observations. It can cause bias, make some functions inapplicable, and obscure insightful data patterns. Consider a dataset of student scores:
Python1import pandas as pd 2 3data = {'Name': ['Anna', 'Bob', 'Charlie', 'David', None], 4 'Score': [85, 88, None, 92, 90]} 5df = pd.DataFrame(data) 6print(df) 7# Output: 8# Name Score 9# 0 Anna 85.0 10# 1 Bob 88.0 11# 2 Charlie NaN 12# 3 David 92.0 13# 4 None 90.0
"Charlie" has a missing score (None
).
Before handling missing values, we must identify them. Pandas' functions isnull()
and notnull()
can perform this task. isnull()
returns a DataFrame where each cell is either True
or False
depending on that cell's null status.
From our student scores data:
Python1print(df.isnull()) 2# Output: 3# Name Score 4# 0 False False 5# 1 False False 6# 2 False True 7# 3 False False 8# 4 True False
The None
(missing) value for "Charlie" returns True
when isnull()
is used. notnull
works similarly, but returns exactly opposite values: True
is for present value!:
Python1print(df.notnull()) 2# Output: 3# Name Score 4# 0 True True 5# 1 True True 6# 2 True False 7# 3 True True 8# 4 False True
After identifying missing values, the next step is handling them. The strategy depends on the nature of our data and analysis purpose. A common strategy is to remove rows with None
values using the dropna()
function:
Python1print(df.dropna()) 2# Output: 3# Name Score 4# 0 Anna 85.0 5# 1 Bob 88.0 6# 3 David 92.0
"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.
To scan only specific columns for missing values with dropna()
, you can use the subset
argument to specify which columns to check for missing values. Here's an example:
Python1# Drop rows where 'Score' column has missing values 2print(df.dropna(subset=['Score'])) 3# Output: 4# Name Score 5# 0 Anna 85.0 6# 1 Bob 88.0 7# 3 David 92.0 8# 4 None 90.0
As you can see, the fourth row is not removed. Though it contains a missing value in the Name
column, this time we only remove rows with missing Score
Another strategy is to fill missing values with a specific value or a calculated value such as the mean, median, or mode. The fillna()
function can achieve this:
Python1print(df.fillna(0)) 2# Output: 3# Name Score 4# 0 Anna 85.0 5# 1 Bob 88.0 6# 2 Charlie 0.0 7# 3 David 92.0
"Charlie"'s score is replaced with 0
.
Additionally, you can use forward fill (ffill
) or backward fill (bfill
) to propagate the next or previous value:
Python1# Using backward fill 2print(df.fillna(method='bfill')) 3# Output: 4# Name Score 5# 0 Anna 85.0 6# 1 Bob 88.0 7# 2 Charlie 92.0 8# 3 David 92.0
Here, "Charlie"'s score is filled using the next available score, which is 92.0
from "David". In the newest pandas versions, there are separate methods df.ffill()
and df.bfill()
for this.
The real-world strategy to handle missing values relies on the data's nature and the analysis ambition. If we're analyzing average student scores, it may be better to fill missing values with the non-missing values' mean. Here's an example:
Python1mean_score = df['Score'].mean() 2print(df.fillna(mean_score)) 3# Output 4# Name Score 5# 0 Anna 85.000000 6# 1 Bob 88.000000 7# 2 Charlie 88.333333 8# 3 David 92.000000
Handling missing values creates clean datasets, which offer a better basis for data analysis. You've now learned how to handle missing values in Python datasets! Are you ready to practice what we've covered? A hands-on approach is the best way to learn and understand data analysis. You're doing great! Let's keep going!