Lesson 1
Handling Missing Values
Lesson Introduction

Hello there! Today, we're going to talk about handling missing values in datasets for machine learning. Why is this important? Imagine you are building a model to predict house prices, but some houses are missing information about their size or the number of bedrooms. These missing values can affect the performance of your model. In this lesson, you'll learn why data might be missing, different ways to handle it, and how to use Python libraries to do so.

By the end of this lesson, you'll know why handling missing values is crucial, understand different strategies to deal with them, and be able to use Python tools to handle missing data efficiently.

Understanding Missing Values

Why does data go missing? There are many reasons:

  1. Human Error: Sometimes, people forget to fill in all the fields when entering data.
  2. System Error: Occasionally, the system that collects the data might have problems.
  3. Other Reasons: Data may be intentionally left out for privacy reasons.

There are three common types of missing data:

  • MCAR (Missing Completely at Random): The data is missing randomly without any pattern.
  • MAR (Missing at Random): There is a pattern, but it is not related to the missing data itself.
  • MNAR (Missing Not at Random): There is a pattern related to why the data is missing.
Strategies for Handling Missing Values

Handling missing values can be done in several ways:

  1. If the missing data is a small percentage, you might just delete those rows or columns. But be careful: if you remove too much data, you might lose important information.

  2. You can also replace the missing values with some constant value like the mean, median, or mode. This method is often more suitable because it still keeps the data structure.

Dropping Missing Values: Part 1

Dropping missing values is easy and straightforward with pandas dataframes. Let's recall it quickly.

Let's consider this simple dataset:

Python
1import pandas as pd 2 3data = {'Name': ['Anna', 'Bob', 'Charlie', 'David', None], 4 'Score': [85, 88, None, 92, 90]} 5df = pd.DataFrame(data)

Let's remove rows with None values using the dropna() function:

Python
1print(df.dropna())

The output is:

1 Name Score 20 Anna 85.0 31 Bob 88.0 43 David 92.0

"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.

Dropping Missing Values: Part 2

To scan only specific columns for missing values with dropna(), you can use the subset argument to specify which columns to check for missing values. Here's an example:

Python
1# Drop rows where 'Score' column has missing values 2print(df.dropna(subset=['Score']))

The output is:

1 Name Score 20 Anna 85.0 31 Bob 88.0 43 David 92.0 54 None 90.0

As you can see, the fourth row is not removed. Though it contains a missing value in the Name column, this time we only remove rows with missing Score.

Using `SciKit Learn` to Impute Missing Values: Part 1

One of the easiest ways to handle missing values in Python is by using the SimpleImputer class from the sklearn.impute module. Let's break it down.

The SimpleImputer has a few strategies you can use:

  • mean: Replaces missing values with the mean of each column.
  • median: Replaces missing values with the median of each column.
  • most_frequent: Replaces missing values with the most frequent value in each column.
  • constant: Replaces missing values with a constant value you provide.

Let's walk through some code that handles missing values using the SimpleImputer.

First, we need a dataset. We'll use the pandas library to create one with some missing values.

Python
1import numpy as np 2import pandas as pd 3 4# Create a sample dataset with missing values 5data = { 6 'Feature1': [1, 2, np.nan, 4], 7 'Feature2': [7, 6, 5, np.nan] 8} 9df = pd.DataFrame(data) 10print("Original DataFrame:") 11print(df)

Output:

1 Feature1 Feature2 20 1.0 7.0 31 2.0 6.0 42 NaN 5.0 53 4.0 NaN

Note that we use np.nan here instead of None. None is a Python singleton object representing missing values across all data types, while np.nan is a floating-point "Not a Number" value from the numpy library, specifically used for numeric data. None is versatile and not tied to any library, but it may cause errors in operations unless explicitly handled. In contrast, np.nan is tailored for numerical computations, supporting vectorized operations in numpy and pandas, making it more suitable for handling missing numerical values.

Using `SciKit Learn` to Impute Missing Values: Part 2

Here, we use the SimpleImputer from sklearn.impute to handle the missing values. In this case, we'll use the mean strategy, meaning the missing values are replaced with the mean value of the corresponding column. Note that missing values won't be taken into account when calculating the mean.

Python
1from sklearn.impute import SimpleImputer 2 3# Handling missing values 4imputer = SimpleImputer(strategy='mean') 5imputed_data = imputer.fit_transform(df) 6print("Imputed Data:") 7print(imputed_data)

Output:

1[[1. 7. ] 2 [2. 6. ] 3 [2.33333333 5. ] 4 [4. 6. ]]
Converting `Numpy` Array Back to `DataFrame`

The result of the imputation is a NumPy array. Let's convert it back to a DataFrame for better readability.

Python
1# Convert the numpy array back to a DataFrame for better readability 2imputed_df = pd.DataFrame(imputed_data, columns=df.columns) 3print("DataFrame after handling missing values:") 4print(imputed_df)

Output:

1 Feature1 Feature2 20 1.000000 7.0 31 2.000000 6.0 42 2.333333 5.0 53 4.000000 6.0

Notice how we use df.columns to assign the same columns names we had before.

Using `SciKit Learn` to Impute Missing Values for Specific Columns

Sometimes, you may want to impute only specific columns in your dataset. You can achieve this by selecting those columns and applying the SimpleImputer to them. Here's how you can do it.

Let's use the same dataset that we created earlier

Python
1from sklearn.impute import SimpleImputer 2 3# Select the column to impute 4feature1 = df[['Feature1']] 5 6# Create the SimpleImputer instance 7imputer = SimpleImputer(strategy='mean') 8 9# Fit and transform the data 10feature1_imputed = imputer.fit_transform(feature1) 11 12# Update the DataFrame 13df['Feature1'] = feature1_imputed 14print("DataFrame after imputing Feature1:") 15print(df)

Output:

1 Feature1 Feature2 20 1.000000 7.0 31 2.000000 6.0 42 2.333333 5.0 53 4.000000 NaN

In this example, the missing value in Feature1 is replaced by the mean of the other values in that column. The Feature2 column remains unchanged. This approach allows you to target specific columns that need imputation while leaving others untouched.

In the same manner, you can impute values into any subset of columns.

Lesson Summary

Great job! 🎉 You've learned why handling missing values is crucial, discovered different strategies to tackle missing data, and practiced using SimpleImputer to handle missing values in a sample dataset. Missing data is a common issue, but now you have the tools to manage it and improve the quality of your datasets.

Now that you've learned the theory, it's time to get hands-on practice! In the practice session, you'll handle missing values in various datasets, experimenting with different imputation strategies, and observing the outcomes. This practice will help solidify your understanding and make you more confident in managing missing data for your machine learning projects. Let's get started! 🚀

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.