Lesson 2

In today's lesson, we will focus on **identifying and handling duplicates and outliers** to clean our dataset for a more precise analysis.

Consider a dataset containing students' details from a school. If a student's information is repeated in the dataset, we classify that as a duplicate. Duplicates can distort our data, leading to inaccurate results during the analysis.

R`1# Create DataFrame 2df <- data.frame( 3 Name = c('John', 'Anna', 'Peter', 'John', 'Anna'), 4 Age = c(16, 15, 13, 16, 15), 5 Grade = c(9, 10, 7, 9, 10) 6)`

R provides efficient functionalities to handle duplicates in a dataset. Here's how you can identify duplicates:

R`1# Identify duplicates 2print(df[duplicated(df),]) 3# 4 John 16 9 4# 5 Anna 15 10`

The `duplicated()`

function in R flags duplicate rows. This function can also be used to remove duplicate rows:

R`1# Remove duplicates 2df <- df[!duplicated(df),] 3print(df)`

After removing the duplicates, your data is clean and ready!

`1 Name Age Grade 21 John 16 9 32 Anna 15 10 43 Peter 13 7`

An outlier is a data point that is anomalously different from other data points in the same dataset. For instance, in our dataset of primary school students' ages, discovering an age like 98 would be considered an outlier.

Outliers can be detected visually using tools like box plots and scatter plots, or even through statistical methods such as the Z-score or IQR. Today, we will use the *IQR method* to detect outliers:

Here's a brief reminder: a value is considered an outlier if it is at least `1.5 * IQR`

less than `Q1`

(first quartile) or at least `1.5 * IQR`

greater than `Q3`

(third quartile).

Let's use the IQR method in R. First, let's define our dataset:

R`1# Create dataset 2df <- data.frame( 3 students = c('Alice', 'Bob', 'John', 'Ann', 'Rob'), 4 scores = c(56, 11, 50, 98, 47) 5)`

Now, let's compute the IQR, Q1, Q3, and detect outliers:

R`1# Compute Q1, Q3, and IQR 2IQR_scores <- IQR(df$scores) # 9 3Q1_scores <- quantile(df$scores, 0.25) # 47 4Q3_scores <- quantile(df$scores, 0.75) # 56 5 6# Lower and Upper Bounds 7lower_bound <- Q1_scores - 1.5 * IQR_scores # 33.5 8upper_bound <- Q3_scores + 1.5 * IQR_scores # 69.5 9 10# Detect outliers 11outliers <- df[(df$scores < lower_bound) | (df$scores > upper_bound),] 12print(outliers)`

Here is the output:

`1 students scores 22 Bob 11 34 Ann 98`

There are generally two strategies for dealing with outliers — removing them or replacing them with a median value.

Removing outliers is the most straightforward method. However, you might opt for other methods as removing outliers can result in data loss. To apply it, let's reverse the condition to choose everything except outliers.

R`1# Remove outliers from data 2df <- df[(df$scores >= lower_bound & df$scores <= upper_bound),] 3print(df)`

There is a resulting data, no outliers included!

`1 students scores 21 Alice 56 33 John 50 45 Rob 47`

Alternatively, outliers can be replaced with median values. The median value is less susceptible to outliers and hence suitable for replacement.

R`1# Replace outliers with median scores 2median_score <- median(df$scores) 3df$scores[df$scores < lower_bound | df$scores > upper_bound] <- median_score 4print(df)`

Here, we select outliers using boolean selection and make them equal to the median score. The median is `50`

, hence outlier scores are replaced with `50`

:

`1 students scores 21 Alice 56 32 Bob 50 43 John 50 54 Ann 50 65 Rob 47`

An alternative to replacing outliers with the median is using the dataset's mean, excluding the outliers. This method ensures that the replacement value reflects the central tendency of the main distribution of data without being skewed by the extreme values.

First, we need to calculate the mean of the data, excluding the outliers:

R`1# Calculating mean without outliers 2mean_scores <- mean(df$scores[(df$scores >= lower_bound & df$scores <= upper_bound)])`

Then, replace the outliers with this mean value:

R`1df$scores[df$scores < lower_bound | df$scores > upper_bound] <- mean_scores`

This approach replaces outliers with a mean score that is representative of the bulk of the data, ensuring a more balanced dataset:

`1 students scores 21 Alice 56.0 32 Bob 51.0 43 John 50.0 54 Ann 51.0 65 Rob 47.0`

Note that the mean value `51`

(rounded for simplicity) is calculated without the outliers, offering a more accurate depiction of the central value of most data points.

This lesson discussed what duplicates and outliers are, their implications on data analysis, and how to handle them using R. The key to accurate data analysis is clean data. Now is the best time to apply these concepts to real-world data! Let's dive into some practical exercises!