Lesson 2

Handling Duplicates and Outliers in Datasets

Topic Overview and Actualization

Today, we target duplicates and outliers to clean our data for more accurate analysis.

Understanding Duplicates in Data

Let's consider a dataset from a school containing students' details. If a student's information appears more than once, that is regarded as a duplicate. Duplicates distort data, leading to inaccurate statistics.

Python Tools for Handling Duplicates

pandas library provides efficient and easy-to-use functions for dealing with duplicates.

Python
1import pandas as pd 2 3# Create DataFrame 4data = {'Name': ['John', 'Anna', 'Peter', 'John', 'Anna'], 5 'Age': [16, 15, 13, 16, 15], 6 'Grade': [9, 10, 7, 9, 10]} 7df = pd.DataFrame(data)

The duplicated() function flags duplicate rows:

Python
1print(df.duplicated()) 2'''Output: 30 False 41 False 52 False 63 True 74 True 8dtype: bool 9'''

A True in the output denotes a row in the DataFrame that repeats. Note, that one of the repeating rows is marked as False – to keep one in case we decide to drop all the duplicates.

The drop_duplicates() function helps to discard these duplicates:

Python
1df = df.drop_duplicates() 2print(df) 3'''Output: 4 Name Age Grade 50 John 16 9 61 Anna 15 10 72 Peter 13 7 8'''

There is no more duplicates, cool!

Understanding Outliers in Data

An outlier is a data point significantly different from others. In our dataset of primary school students' ages, we might find an age like 98 — this would be an outlier.

Identifying Outliers

Outliers can be detected visually using tools like box plots, scatter plots, or statistical methods such as Z-score or IQR. Let's consider a data point that's significantly different from the rest. We'll use the IQR method for identifying outliers.

As a short reminder, we consider a value an outlier if it is either at least 1.5 * IQR less than Q1 (first quartile) or at least 1.5 * IQR greater than Q3 (third quartile).

Python Tools for Handling Outliers

Here's how you can utilize the IQR method with pandas. Let's start with defining the dataset of students' scores:

Python
1import pandas as pd 2 3# Create dataset 4data = pd.DataFrame({ 5 'students': ['Alice', 'Bob', 'John', 'Ann', 'Rob'], 6 'scores': [56, 11, 50, 98, 47] 7}) 8df = pd.DataFrame(data)

Now, compute Q1, Q3, and IQR:

Python
1Q1 = df['scores'].quantile(0.25) # 47.0 2Q3 = df['scores'].quantile(0.75) # 56.0 3IQR = Q3 - Q1 # 9.0

After that, we can define the lower and upper bounds and find outliers:

Python
1lower_bound = Q1 - 1.5 * IQR 2upper_bound = Q3 + 1.5 * IQR 3outliers = df[(df['scores'] < lower_bound) | (df['scores'] > upper_bound)] 4print(outliers) 5'''Output: 6 students scores 71 Bob 11 83 Ann 98 9'''
Handling Outliers: Removal

Typically, there are two common strategies for dealing with outliers: remove them or replace them with a median value.

Removing outliers is the easiest method. However, there are better methods than this since you essentially throw away your data. To apply it, let's reverse the condition to choose everything except outliers.

Python
1df = df[(df['scores'] >= lower_bound) & (df['scores'] <= upper_bound)] 2print(df) 3'''Output: 4 students scores 50 Alice 56 62 John 50 74 Rob 47 8'''
Handling Outliers: Replacement

The second strategy is replacing outliers with median values - they are less susceptible to outliers, so we can use them for replacement.

The easiest way to apply this replacement is to first replace outliers with np.nan and then use the fill method. It could lead to problems, as there could already be some missing values in the dataframe, which will also be filled.

Instead, we could use the np.where function:

Python
1median = df['scores'].median() 2df['scores'] = np.where((df['scores'] > upper_bound) | (df['scores'] < lower_bound), median, df['scores'])

It works by choosing elements from df['scores'] if the condition is not met (e.g., value is not an outlier) and from median otherwise. In other words, whenever this function meets an outlier, it will ignore it and use median instead of it.

Summary

We've covered what duplicates and outliers are, their impact on data analysis, and how to manage them. A clean dataset is a prerequisite for accurate data analysis. Now, it's time to apply your skills to real-world data. Let's dive into some practical exercises!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.