Data Cleaning Techniques: Managing Duplicates and Outliers in R

Lesson 2

Topic Overview and Actualization

In today's lesson, we will focus on identifying and handling duplicates and outliers to clean our dataset for a more precise analysis.

R Tools for Handling Duplicates

Consider a dataset containing students' details from a school. If a student's information is repeated in the dataset, we classify that as a duplicate. Duplicates can distort our data, leading to inaccurate results during the analysis.

R
1# Create DataFrame
2df <- data.frame(
3    Name = c('John', 'Anna', 'Peter', 'John', 'Anna'), 
4    Age =  c(16, 15, 13, 16, 15), 
5    Grade = c(9, 10, 7, 9, 10)
6)

R provides efficient functionalities to handle duplicates in a dataset. Here's how you can identify duplicates:

R
1# Identify duplicates
2print(df[duplicated(df),])
3# 4 John  16     9
4# 5 Anna  15    10

The duplicated() function in R flags duplicate rows. This function can also be used to remove duplicate rows:

R
1# Remove duplicates
2df <- df[!duplicated(df),]
3print(df)

After removing the duplicates, your data is clean and ready!


1  Name Age Grade
21  John  16     9
32  Anna  15    10
43 Peter  13     7

Identifying Outliers

An outlier is a data point that is anomalously different from other data points in the same dataset. For instance, in our dataset of primary school students' ages, discovering an age like 98 would be considered an outlier.

Outliers can be detected visually using tools like box plots and scatter plots, or even through statistical methods such as the Z-score or IQR. Today, we will use the IQR method to detect outliers:

Here's a brief reminder: a value is considered an outlier if it is at least 1.5 * IQR less than Q1 (first quartile) or at least 1.5 * IQR greater than Q3 (third quartile).

R Tools for Handling Outliers

Let's use the IQR method in R. First, let's define our dataset:

R
1# Create dataset
2df <- data.frame(
3   students = c('Alice', 'Bob', 'John', 'Ann', 'Rob'),
4   scores = c(56, 11, 50, 98, 47)
5)

Now, let's compute the IQR, Q1, Q3, and detect outliers:

R
1# Compute Q1, Q3, and IQR
2IQR_scores <- IQR(df$scores)  # 9
3Q1_scores <- quantile(df$scores, 0.25)  # 47
4Q3_scores <- quantile(df$scores, 0.75)  # 56
5
6# Lower and Upper Bounds
7lower_bound <- Q1_scores - 1.5 * IQR_scores  # 33.5
8upper_bound <- Q3_scores + 1.5 * IQR_scores  # 69.5
9
10# Detect outliers
11outliers <- df[(df$scores < lower_bound) | (df$scores > upper_bound),]
12print(outliers)

Here is the output:


1  students scores
22      Bob     11
34      Ann     98

Handling Outliers: Removal

There are generally two strategies for dealing with outliers — removing them or replacing them with a median value.

Removing outliers is the most straightforward method. However, you might opt for other methods as removing outliers can result in data loss. To apply it, let's reverse the condition to choose everything except outliers.

R
1# Remove outliers from data
2df <- df[(df$scores >= lower_bound & df$scores <= upper_bound),]
3print(df)

There is a resulting data, no outliers included!


1  students scores
21    Alice     56
33     John     50
45      Rob     47

Handling Outliers: Replacement

Alternatively, outliers can be replaced with median values. The median value is less susceptible to outliers and hence suitable for replacement.

R
1# Replace outliers with median scores
2median_score <- median(df$scores)
3df$scores[df$scores < lower_bound | df$scores > upper_bound] <- median_score
4print(df)

Here, we select outliers using boolean selection and make them equal to the median score. The median is 50, hence outlier scores are replaced with 50:


1  students scores
21    Alice     56
32      Bob     50
43     John     50
54      Ann     50
65      Rob     47

Handling Outliers: Replacement with Mean

An alternative to replacing outliers with the median is using the dataset's mean, excluding the outliers. This method ensures that the replacement value reflects the central tendency of the main distribution of data without being skewed by the extreme values.

First, we need to calculate the mean of the data, excluding the outliers:

R
1# Calculating mean without outliers
2mean_scores <- mean(df$scores[(df$scores >= lower_bound & df$scores <= upper_bound)])

Then, replace the outliers with this mean value:

R
1df$scores[df$scores < lower_bound | df$scores > upper_bound] <- mean_scores

This approach replaces outliers with a mean score that is representative of the bulk of the data, ensuring a more balanced dataset:


1  students scores
21    Alice   56.0
32      Bob   51.0
43     John   50.0
54      Ann   51.0
65      Rob   47.0

Note that the mean value 51 (rounded for simplicity) is calculated without the outliers, offering a more accurate depiction of the central value of most data points.

Summary

This lesson discussed what duplicates and outliers are, their implications on data analysis, and how to handle them using R. The key to accurate data analysis is clean data. Now is the best time to apply these concepts to real-world data! Let's dive into some practical exercises!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.