Lesson 1

Identifying and Handling Missing Values in R Data Cleaning Process

Introduction and Overview

Welcome! Today's focus is 'Identifying and Handling Missing Values,' a crucial step in data cleansing that ensures the completeness of our dataset. Essential for reliable analysis, we'll delve into the complexities of finding and handling missing values.

The Art of Data Cleaning

Imagine untangling a pile of necklaces; it's a tedious but necessary task for utilizing each piece. Likewise, datasets may contain chaos, such as misspellings, incorrect data types, and even missing values, all of which need order. This sorting process is known as 'Data Cleaning'.

Identifying Missing Values

Missing values often appear as NA. The R language simplifies their detection using the is.na() function. This function returns a logical vector, which replaces missing values with True and non-missing values with False.

Let's examine this functionality using a small dataset:

R
1data <- data.frame( 2 "A" = c(2, 4, NA, 8), 3 "B" = c(5, NA, 7, 9), 4 "C" = c(12, 13, 14, NA) 5) 6 7# Identify missing values 8print(is.na(data))

Output is:

1 A B C 2[1,] FALSE FALSE FALSE 3[2,] FALSE TRUE FALSE 4[3,] TRUE FALSE FALSE 5[4,] FALSE FALSE TRUE

Using this, we've identified the missing values.

Handling Missing Values

After identifying missing values, R language provides several strategies to handle them. We will use a tidyverse library for it. It has the following functions:

  • replace_na(): Replaces the missing values.
  • na.omit(): Removes the missing values.

Note that both functions return a new DataFrame. If you want to update the original data, you'll need to re-assign it

Let's apply these strategies:

  • Replacing:
R
1library(tidyverse) 2 3# Fill missing values with 0 4data <- replace_na(data, list("A" = 0, "B" = 0, "C" = 0)) 5print(data)

Output is:

1 A B C 21 2 5 12 32 4 0 13 43 0 7 14 54 8 9 0

The missing values were replaced with 0.

  • Removing:
R
1# Remove rows with missing values 2data <- na.omit(data) 3print(data)

Output is:

1 A B C 21 2 5 12

In the latter example, all rows with NA were deleted, leaving us with just one row.

Note that when you load the tidyverse library, you get an error message. This message shows the versions of the core packages loaded and highlights any function name conflicts between packages. For these courses this conflicts is not a problem, so you may ignore it. But generally a systematic way to handle conflicts is to use the conflicted library. However, we won't cover it in this particular course.

Handling Missing Values in One Column

The replace_na() function applies to the entire dataset, which might not always be ideal. In most instances, you'll want to handle missing values in specific columns individually. In an R dataframe df, columns can be accessed like list elements:

R
1library(tidyverse) 2 3# Fill missing values of column "A" with 0 4data$A <- replace_na(data$A, 0) 5print(data)

Output is:

1 A B C 21 2 5 12 32 4 NA 13 43 0 7 14 54 8 9 NA

Now, there are no more missing values in the "A" column.

Real-World Implications

Missing values in real-world datasets are inevitable. Whether it's a company's financial data or a hospital's patient records, missing values are present and need to be properly managed, as they significantly affect the outcome of our analysis.

A common technique for filling missing values is to use the average value. Here's an example using incomplete age values:

R
1library(tidyverse) 2 3# Create a simple dataframe 4data <- data.frame( 5 "name" = c('Alice', 'Bob', 'Charlie', 'David', 'Eve'), 6 "age" = c(25, NA, 35, NA, 45) 7) 8 9# Filling missing values with mean 10mean_age <- mean(data$age, na.rm = TRUE) 11data$age <- replace_na(data$age, mean_age) 12 13print(data)

Output is:

1 name age 21 Alice 25 32 Bob 35 43 Charlie 35 54 David 35 65 Eve 45

In the above example, we first create a dataframe with names and ages, where some age values are missing (represented as NA in the dataframe). To fill the missing age values with the mean age, we use the replace_na() function, with the parameter being mean(data$age, na.rm = TRUE). The mean() function ignores missing values (i.e., NAs) when the na.rm parameter is set to TRUE; hence, it works correctly without any workarounds.

Lesson Summary

Congratulations! You've learned how to identify and handle missing values using R. Now, prepare for some hands-on exercises to apply these concepts to various datasets. This lesson is an opportunity to solidify your understanding and refine your skills in managing missing values. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.