Handling Duplicates and Outliers in Datasets

Lesson 2

Topic Overview and Actualization

Today, we target duplicates and outliers to clean our data for more accurate analysis.

Understanding Duplicates in Data

Let's consider a dataset from a school containing students' details. If a student's information appears more than once, that is regarded as a duplicate. Duplicates distort data, leading to inaccurate statistics.

Python Tools for Handling Duplicates

pandas library provides efficient and easy-to-use functions for dealing with duplicates.

Python
1import pandas as pd
2
3# Create DataFrame
4data = {'Name': ['John', 'Anna', 'Peter', 'John', 'Anna'],
5        'Age': [16, 15, 13, 16, 15],
6        'Grade': [9, 10, 7, 9, 10]}
7df = pd.DataFrame(data)

The duplicated() function flags duplicate rows:

Python
1print(df.duplicated())
2'''Output:
30    False
41    False
52    False
63     True
74     True
8dtype: bool
9'''

A True in the output denotes a row in the DataFrame that repeats. Note, that one of the repeating rows is marked as False – to keep one in case we decide to drop all the duplicates.

The drop_duplicates() function helps to discard these duplicates:

Python
1df = df.drop_duplicates()
2print(df)
3'''Output:
4    Name  Age  Grade
50   John   16      9
61   Anna   15     10
72  Peter   13      7
8'''

There is no more duplicates, cool!

Understanding Outliers in Data

An outlier is a data point significantly different from others. In our dataset of primary school students' ages, we might find an age like 98 — this would be an outlier.

Identifying Outliers

Outliers can be detected visually using tools like box plots, scatter plots, or statistical methods such as Z-score or IQR. Let's consider a data point that's significantly different from the rest. We'll use the IQR method for identifying outliers.

As a short reminder, we consider a value an outlier if it is either at least 1.5 * IQR less than Q1 (first quartile) or at least 1.5 * IQR greater than Q3 (third quartile).

Python Tools for Handling Outliers

Here's how you can utilize the IQR method with pandas. Let's start with defining the dataset of students' scores:

Python
1import pandas as pd
2
3# Create dataset
4data = pd.DataFrame({
5    'students': ['Alice', 'Bob', 'John', 'Ann', 'Rob'],
6    'scores': [56, 11, 50, 98, 47]
7})
8df = pd.DataFrame(data)

Now, compute Q1, Q3, and IQR:

Python
1Q1 = df['scores'].quantile(0.25)  # 47.0
2Q3 = df['scores'].quantile(0.75)  # 56.0
3IQR = Q3 - Q1  # 9.0

After that, we can define the lower and upper bounds and find outliers:

Python
1lower_bound = Q1 - 1.5 * IQR
2upper_bound = Q3 + 1.5 * IQR
3outliers = df[(df['scores'] < lower_bound) | (df['scores'] > upper_bound)]
4print(outliers)
5'''Output:
6  students  scores
71      Bob      11
83      Ann      98
9'''

Handling Outliers: Removal

Typically, there are two common strategies for dealing with outliers: remove them or replace them with a median value.

Removing outliers is the easiest method. However, there are better methods than this since you essentially throw away your data. To apply it, let's reverse the condition to choose everything except outliers.

Python
1df = df[(df['scores'] >= lower_bound) & (df['scores'] <= upper_bound)]
2print(df)
3'''Output:
4  students  scores
50    Alice      56
62     John      50
74      Rob      47
8'''

Handling Outliers: Replacement

The second strategy is replacing outliers with median values - they are less susceptible to outliers, so we can use them for replacement.

The easiest way to apply this replacement is to first replace outliers with np.nan and then use the fill method. It could lead to problems, as there could already be some missing values in the dataframe, which will also be filled.

Instead, we could use the np.where function:

Python
1median = df['scores'].median()
2df['scores'] = np.where((df['scores'] > upper_bound) | (df['scores'] < lower_bound), median, df['scores'])

It works by choosing elements from df['scores'] if the condition is not met (e.g., value is not an outlier) and from median otherwise. In other words, whenever this function meets an outlier, it will ignore it and use median instead of it.

Summary

We've covered what duplicates and outliers are, their impact on data analysis, and how to manage them. A clean dataset is a prerequisite for accurate data analysis. Now, it's time to apply your skills to real-world data. Let's dive into some practical exercises!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.