Today, we target duplicates and outliers to clean our data for more accurate analysis.
Let's consider a dataset from a school containing students' details. If a student's information appears more than once, that is regarded as a duplicate. Duplicates distort data, leading to inaccurate statistics.
pandas library provides efficient and easy-to-use functions for dealing with duplicates.
Python1import pandas as pd 2 3# Create DataFrame 4data = {'Name': ['John', 'Anna', 'Peter', 'John', 'Anna'], 5 'Age': [16, 15, 13, 16, 15], 6 'Grade': [9, 10, 7, 9, 10]} 7df = pd.DataFrame(data)
The duplicated() function flags duplicate rows:
Python1print(df.duplicated()) 2'''Output: 30 False 41 False 52 False 63 True 74 True 8dtype: bool 9'''
A True
in the output denotes a row in the DataFrame that repeats. Note, that one of the repeating rows is marked as False
– to keep one in case we decide to drop all the duplicates.
The drop_duplicates()
function helps to discard these duplicates:
Python1df = df.drop_duplicates() 2print(df) 3'''Output: 4 Name Age Grade 50 John 16 9 61 Anna 15 10 72 Peter 13 7 8'''
There is no more duplicates, cool!
An outlier is a data point significantly different from others. In our dataset of primary school students' ages, we might find an age like 98 — this would be an outlier.
Outliers can be detected visually using tools like box plots, scatter plots, or statistical methods such as Z-score or IQR. Let's consider a data point that's significantly different from the rest. We'll use the IQR method for identifying outliers.
As a short reminder, we consider a value an outlier if it is either at least 1.5 * IQR
less than Q1
(first quartile) or at least 1.5 * IQR
greater than Q3
(third quartile).
Here's how you can utilize the IQR method with pandas. Let's start with defining the dataset of students' scores:
Python1import pandas as pd 2 3# Create dataset 4data = pd.DataFrame({ 5 'students': ['Alice', 'Bob', 'John', 'Ann', 'Rob'], 6 'scores': [56, 11, 50, 98, 47] 7}) 8df = pd.DataFrame(data)
Now, compute Q1, Q3, and IQR:
Python1Q1 = df['scores'].quantile(0.25) # 47.0 2Q3 = df['scores'].quantile(0.75) # 56.0 3IQR = Q3 - Q1 # 9.0
After that, we can define the lower and upper bounds and find outliers:
Python1lower_bound = Q1 - 1.5 * IQR 2upper_bound = Q3 + 1.5 * IQR 3outliers = df[(df['scores'] < lower_bound) | (df['scores'] > upper_bound)] 4print(outliers) 5'''Output: 6 students scores 71 Bob 11 83 Ann 98 9'''
Typically, there are two common strategies for dealing with outliers: remove them or replace them with a median value.
Removing outliers is the easiest method. However, there are better methods than this since you essentially throw away your data. To apply it, let's reverse the condition to choose everything except outliers.
Python1df = df[(df['scores'] >= lower_bound) & (df['scores'] <= upper_bound)] 2print(df) 3'''Output: 4 students scores 50 Alice 56 62 John 50 74 Rob 47 8'''
The second strategy is replacing outliers with median values - they are less susceptible to outliers, so we can use them for replacement.
The easiest way to apply this replacement is to first replace outliers with np.nan and then use the fill method. It could lead to problems, as there could already be some missing values in the dataframe, which will also be filled.
Instead, we could use the np.where
function:
Python1median = df['scores'].median() 2df['scores'] = np.where((df['scores'] > upper_bound) | (df['scores'] < lower_bound), median, df['scores'])
It works by choosing elements from df['scores']
if the condition is not met (e.g., value is not an outlier) and from median otherwise. In other words, whenever this function meets an outlier, it will ignore it and use median instead of it.
We've covered what duplicates and outliers are, their impact on data analysis, and how to manage them. A clean dataset is a prerequisite for accurate data analysis. Now, it's time to apply your skills to real-world data. Let's dive into some practical exercises!