Handling Missing Values

Introduction to Machine Learning with SciKit Learn

Data Preprocessing For Machine LearningLesson 1

Lesson 1

Handling Missing Values

Lesson Introduction

Hello there! Today, we're going to talk about handling missing values in datasets for machine learning. Why is this important? Imagine you are building a model to predict house prices, but some houses are missing information about their size or the number of bedrooms. These missing values can affect the performance of your model. In this lesson, you'll learn why data might be missing, different ways to handle it, and how to use Python libraries to do so.

By the end of this lesson, you'll know why handling missing values is crucial, understand different strategies to deal with them, and be able to use Python tools to handle missing data efficiently.

Understanding Missing Values

Why does data go missing? There are many reasons:

Human Error: Sometimes, people forget to fill in all the fields when entering data.
System Error: Occasionally, the system that collects the data might have problems.
Other Reasons: Data may be intentionally left out for privacy reasons.

There are three common types of missing data:

MCAR (Missing Completely at Random): The data is missing randomly without any pattern.
MAR (Missing at Random): There is a pattern, but it is not related to the missing data itself.
MNAR (Missing Not at Random): There is a pattern related to why the data is missing.

Strategies for Handling Missing Values

Handling missing values can be done in several ways:

If the missing data is a small percentage, you might just delete those rows or columns. But be careful: if you remove too much data, you might lose important information.
You can also replace the missing values with some constant value like the mean, median, or mode. This method is often more suitable because it still keeps the data structure.

Dropping Missing Values: Part 1

Dropping missing values is easy and straightforward with pandas dataframes. Let's recall it quickly.

Let's consider this simple dataset:

Python
1import pandas as pd
2
3data = {'Name': ['Anna', 'Bob', 'Charlie', 'David', None],
4        'Score': [85, 88, None, 92, 90]}
5df = pd.DataFrame(data)

Let's remove rows with None values using the dropna() function:

Python
1print(df.dropna())

The output is:


1   Name  Score
20  Anna   85.0
31   Bob   88.0
43 David   92.0

"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.

Dropping Missing Values: Part 2

To scan only specific columns for missing values with dropna(), you can use the subset argument to specify which columns to check for missing values. Here's an example:

Python
1# Drop rows where 'Score' column has missing values
2print(df.dropna(subset=['Score']))

The output is:


1   Name  Score
20  Anna   85.0
31   Bob   88.0
43 David   92.0
54  None   90.0

As you can see, the fourth row is not removed. Though it contains a missing value in the Name column, this time we only remove rows with missing Score.

Using `SciKit Learn` to Impute Missing Values: Part 1

One of the easiest ways to handle missing values in Python is by using the SimpleImputer class from the sklearn.impute module. Let's break it down.

The SimpleImputer has a few strategies you can use:

mean: Replaces missing values with the mean of each column.
median: Replaces missing values with the median of each column.
most_frequent: Replaces missing values with the most frequent value in each column.
constant: Replaces missing values with a constant value you provide.

Let's walk through some code that handles missing values using the SimpleImputer.

First, we need a dataset. We'll use the pandas library to create one with some missing values.

Python
1import numpy as np
2import pandas as pd
3
4# Create a sample dataset with missing values
5data = {
6    'Feature1': [1, 2, np.nan, 4],
7    'Feature2': [7, 6, 5, np.nan]
8}
9df = pd.DataFrame(data)
10print("Original DataFrame:")
11print(df)

Output:


1   Feature1  Feature2
20       1.0       7.0
31       2.0       6.0
42       NaN       5.0
53       4.0       NaN

Note that we use np.nan here instead of None. None is a Python singleton object representing missing values across all data types, while np.nan is a floating-point "Not a Number" value from the numpy library, specifically used for numeric data. None is versatile and not tied to any library, but it may cause errors in operations unless explicitly handled. In contrast, np.nan is tailored for numerical computations, supporting vectorized operations in numpy and pandas, making it more suitable for handling missing numerical values.

Using `SciKit Learn` to Impute Missing Values: Part 2

Here, we use the SimpleImputer from sklearn.impute to handle the missing values. In this case, we'll use the mean strategy, meaning the missing values are replaced with the mean value of the corresponding column. Note that missing values won't be taken into account when calculating the mean.

Python
1from sklearn.impute import SimpleImputer
2
3# Handling missing values
4imputer = SimpleImputer(strategy='mean')
5imputed_data = imputer.fit_transform(df)
6print("Imputed Data:")
7print(imputed_data)

Output:


1[[1.  7. ]
2 [2.  6. ]
3 [2.33333333 5. ]
4 [4.  6. ]]

Converting `Numpy` Array Back to `DataFrame`

The result of the imputation is a NumPy array. Let's convert it back to a DataFrame for better readability.

Python
1# Convert the numpy array back to a DataFrame for better readability
2imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
3print("DataFrame after handling missing values:")
4print(imputed_df)

Output:


1   Feature1  Feature2
20  1.000000       7.0
31  2.000000       6.0
42  2.333333       5.0
53  4.000000       6.0

Notice how we use df.columns to assign the same columns names we had before.

Using `SciKit Learn` to Impute Missing Values for Specific Columns

Sometimes, you may want to impute only specific columns in your dataset. You can achieve this by selecting those columns and applying the SimpleImputer to them. Here's how you can do it.

Let's use the same dataset that we created earlier

Python
1from sklearn.impute import SimpleImputer
2
3# Select the column to impute
4feature1 = df[['Feature1']]
5
6# Create the SimpleImputer instance
7imputer = SimpleImputer(strategy='mean')
8
9# Fit and transform the data
10feature1_imputed = imputer.fit_transform(feature1)
11
12# Update the DataFrame
13df['Feature1'] = feature1_imputed
14print("DataFrame after imputing Feature1:")
15print(df)

Output:


1   Feature1  Feature2
20  1.000000       7.0
31  2.000000       6.0
42  2.333333       5.0
53  4.000000       NaN

In this example, the missing value in Feature1 is replaced by the mean of the other values in that column. The Feature2 column remains unchanged. This approach allows you to target specific columns that need imputation while leaving others untouched.

In the same manner, you can impute values into any subset of columns.

Lesson Summary

Great job! 🎉 You've learned why handling missing values is crucial, discovered different strategies to tackle missing data, and practiced using SimpleImputer to handle missing values in a sample dataset. Missing data is a common issue, but now you have the tools to manage it and improve the quality of your datasets.

Now that you've learned the theory, it's time to get hands-on practice! In the practice session, you'll handle missing values in various datasets, experimenting with different imputation strategies, and observing the outcomes. This practice will help solidify your understanding and make you more confident in managing missing data for your machine learning projects. Let's get started! 🚀

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.