Hello there! Today, we're going to talk about handling missing values in datasets for machine learning. Why is this important? Imagine you are building a model to predict house prices, but some houses are missing information about their size or the number of bedrooms. These missing values can affect the performance of your model. In this lesson, you'll learn why data might be missing, different ways to handle it, and how to use Python libraries to do so.
By the end of this lesson, you'll know why handling missing values is crucial, understand different strategies to deal with them, and be able to use Python tools to handle missing data efficiently.
Why does data go missing? There are many reasons:
- Human Error: Sometimes, people forget to fill in all the fields when entering data.
- System Error: Occasionally, the system that collects the data might have problems.
- Other Reasons: Data may be intentionally left out for privacy reasons.
There are three common types of missing data:
- MCAR (Missing Completely at Random): The data is missing randomly without any pattern.
- MAR (Missing at Random): There is a pattern, but it is not related to the missing data itself.
- MNAR (Missing Not at Random): There is a pattern related to why the data is missing.
Handling missing values can be done in several ways:
-
If the missing data is a small percentage, you might just delete those rows or columns. But be careful: if you remove too much data, you might lose important information.
-
You can also replace the missing values with some constant value like the mean, median, or mode. This method is often more suitable because it still keeps the data structure.
Dropping missing values is easy and straightforward with pandas dataframes. Let's recall it quickly.
Let's consider this simple dataset:
Python1import pandas as pd 2 3data = {'Name': ['Anna', 'Bob', 'Charlie', 'David', None], 4 'Score': [85, 88, None, 92, 90]} 5df = pd.DataFrame(data)
Let's remove rows with None values using the dropna() function:
Python1print(df.dropna())
The output is:
1 Name Score 20 Anna 85.0 31 Bob 88.0 43 David 92.0
"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.
To scan only specific columns for missing values with dropna(), you can use the subset argument to specify which columns to check for missing values. Here's an example:
Python1# Drop rows where 'Score' column has missing values 2print(df.dropna(subset=['Score']))
The output is:
1 Name Score 20 Anna 85.0 31 Bob 88.0 43 David 92.0 54 None 90.0
As you can see, the fourth row is not removed. Though it contains a missing value in the Name
column, this time we only remove rows with missing Score.
One of the easiest ways to handle missing values in Python is by using the SimpleImputer
class from the sklearn.impute
module. Let's break it down.
The SimpleImputer
has a few strategies you can use:
- mean: Replaces missing values with the mean of each column.
- median: Replaces missing values with the median of each column.
- most_frequent: Replaces missing values with the most frequent value in each column.
- constant: Replaces missing values with a constant value you provide.
Let's walk through some code that handles missing values using the SimpleImputer
.
First, we need a dataset. We'll use the pandas
library to create one with some missing values.
Python1import numpy as np 2import pandas as pd 3 4# Create a sample dataset with missing values 5data = { 6 'Feature1': [1, 2, np.nan, 4], 7 'Feature2': [7, 6, 5, np.nan] 8} 9df = pd.DataFrame(data) 10print("Original DataFrame:") 11print(df)
Output:
1 Feature1 Feature2 20 1.0 7.0 31 2.0 6.0 42 NaN 5.0 53 4.0 NaN
Note that we use np.nan
here instead of None
. None
is a Python singleton object representing missing values across all data types, while np.nan
is a floating-point "Not a Number" value from the numpy
library, specifically used for numeric data. None
is versatile and not tied to any library, but it may cause errors in operations unless explicitly handled. In contrast, np.nan
is tailored for numerical computations, supporting vectorized operations in numpy
and pandas
, making it more suitable for handling missing numerical values.
Here, we use the SimpleImputer
from sklearn.impute
to handle the missing values. In this case, we'll use the mean
strategy, meaning the missing values are replaced with the mean value of the corresponding column. Note that missing values won't be taken into account when calculating the mean.
Python1from sklearn.impute import SimpleImputer 2 3# Handling missing values 4imputer = SimpleImputer(strategy='mean') 5imputed_data = imputer.fit_transform(df) 6print("Imputed Data:") 7print(imputed_data)
Output:
1[[1. 7. ] 2 [2. 6. ] 3 [2.33333333 5. ] 4 [4. 6. ]]
The result of the imputation is a NumPy
array. Let's convert it back to a DataFrame
for better readability.
Python1# Convert the numpy array back to a DataFrame for better readability 2imputed_df = pd.DataFrame(imputed_data, columns=df.columns) 3print("DataFrame after handling missing values:") 4print(imputed_df)
Output:
1 Feature1 Feature2 20 1.000000 7.0 31 2.000000 6.0 42 2.333333 5.0 53 4.000000 6.0
Notice how we use df.columns
to assign the same columns names we had before.
Sometimes, you may want to impute only specific columns in your dataset. You can achieve this by selecting those columns and applying the SimpleImputer
to them. Here's how you can do it.
Let's use the same dataset that we created earlier
Python1from sklearn.impute import SimpleImputer 2 3# Select the column to impute 4feature1 = df[['Feature1']] 5 6# Create the SimpleImputer instance 7imputer = SimpleImputer(strategy='mean') 8 9# Fit and transform the data 10feature1_imputed = imputer.fit_transform(feature1) 11 12# Update the DataFrame 13df['Feature1'] = feature1_imputed 14print("DataFrame after imputing Feature1:") 15print(df)
Output:
1 Feature1 Feature2 20 1.000000 7.0 31 2.000000 6.0 42 2.333333 5.0 53 4.000000 NaN
In this example, the missing value in Feature1
is replaced by the mean of the other values in that column. The Feature2
column remains unchanged. This approach allows you to target specific columns that need imputation while leaving others untouched.
In the same manner, you can impute values into any subset of columns.
Great job! 🎉 You've learned why handling missing values is crucial, discovered different strategies to tackle missing data, and practiced using SimpleImputer
to handle missing values in a sample dataset. Missing data is a common issue, but now you have the tools to manage it and improve the quality of your datasets.
Now that you've learned the theory, it's time to get hands-on practice! In the practice session, you'll handle missing values in various datasets, experimenting with different imputation strategies, and observing the outcomes. This practice will help solidify your understanding and make you more confident in managing missing data for your machine learning projects. Let's get started! 🚀