Greetings! Our topic today is 'Identifying and Handling Missing Values', a critical step in data cleaning that ensures our dataset is complete. Essential for accurate analysis, we'll unravel the intricacies of identifying and treating missing values.
Imagine untangling a heap of necklaces — it's tedious but necessary to use each piece. Similarly, datasets may contain confusion like misspellings, incorrect data types, and even missing values, all needing to be sorted. This sorting process is known as 'Data Cleaning'.
Missing values often pose as 'NA', 'None', 'NaN', or zeros. Python's Pandas
library simplifies the process of spotting them using the isnull()
function: this function returns a DataFrame, replacing missing cells with True and non-missing cells with False.
Take a look at this mini-dataset:
Python1import pandas as pd 2 3data = {'A':[2, 4, None, 8], 'B':[5, None, 7, 9], 'C':[12, 13, 14, None]} 4df = pd.DataFrame(data) 5 6# Spot missing values 7print(df.isnull()) 8 9'''Output: 10 A B C 110 False False False 121 False True False 132 True False False 143 False False True 15'''
Using this, we can identify the missing values.
After identification, missing values need to be dealt with. Python provides several strategies:
fillna()
: Fills the missing values.dropna()
: Removes the missing values.
Let's apply these strategies:
Python1# Fill missing values with 0 2print(df.fillna(0)) 3'''Output: 4 A B C 50 2.0 5.0 12.0 61 4.0 0.0 13.0 72 0.0 7.0 14.0 83 8.0 9.0 0.0 9'''
Python1# Remove rows with missing values 2print(df.dropna()) 3'''Output: 4 A B C 50 2.0 5.0 12.0 6'''
In the last example, all rows with None
values were deleted, leaving us just one row.
Note that both functions return a new DataFrame. If you want to update the original df
, you'll need to re-assign it:
Python1df = df.fillna(0)
The df.fillna
function applies to the whole dataset, which might not be the best strategy. Most of the time, you want to handle missing values in certain columns separately. It's simple! As DataFrame acts like a dictionary, we could access and re-assign separate columns using the key:
Python1# Fill missing values of the "A" column with 0 2df["A"] = df["A"].fillna(0) 3print(df) 4'''Output: 5 A B C 60 2.0 5.0 12.0 71 4.0 NaN 13.0 82 0.0 7.0 14.0 93 8.0 9.0 NaN 10'''
Note that there are no more missing values in the "A"
column.
In the real world, missing values are expected. Whether it's a company's financial data or a hospital's patient medical records, missing values exist and must be appropriately addressed, as they significantly influence the results of our analysis.
A common way to fill in missing values is to use mean value. Let's consider a simple example where some values of the age
column are missing:
Python1import pandas as pd 2import numpy as np 3 4# Creating a simple dataframe 5df = pd.DataFrame({ 6 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 7 'age': [25, np.nan, 35, np.nan, 45] 8}) 9 10# Filling missing values with mean 11mean_age = df['age'].mean() # 35 12df = df['age'].fillna(mean_age) 13 14print(df) 15'''Output: 160 25.0 171 35.0 182 35.0 193 35.0 204 45.0 21'''
In the above example, we first create a dataframe with names and ages, where some age values are missing (represented as np.nan
in the dataframe). To fill the missing age values with the mean age, we use the fillna()
function with df['age'].mean()
as the argument. Luckily, df['age'].mean()
doesn't consider missing values – hence, it works correctly without any workarounds.
When we print out the resulting dataframe, we see that missing values are replaced with 35
– the mean age.
Great job! You've learned to identify and handle missing values using Python's Pandas
library. Now, gear up for some hands-on tasks to apply these techniques to different datasets. It's an opportunity to solidify your understanding and hone your skills in handling missing values. Happy coding!