Welcome! Today's focus is 'Identifying and Handling Missing Values,' a crucial step in data cleansing that ensures the completeness of our dataset. Essential for reliable analysis, we'll delve into the complexities of finding and handling missing values.
Imagine untangling a pile of necklaces; it's a tedious but necessary task for utilizing each piece. Likewise, datasets may contain chaos, such as misspellings, incorrect data types, and even missing values, all of which need order. This sorting process is known as 'Data Cleaning'.
Missing values often appear as NA
. The R language simplifies their detection using the is.na()
function. This function returns a logical vector, which replaces missing values with True
and non-missing values with False
.
Let's examine this functionality using a small dataset:
R1data <- data.frame( 2 "A" = c(2, 4, NA, 8), 3 "B" = c(5, NA, 7, 9), 4 "C" = c(12, 13, 14, NA) 5) 6 7# Identify missing values 8print(is.na(data))
Output is:
1 A B C 2[1,] FALSE FALSE FALSE 3[2,] FALSE TRUE FALSE 4[3,] TRUE FALSE FALSE 5[4,] FALSE FALSE TRUE
Using this, we've identified the missing values.
After identifying missing values, R language provides several strategies to handle them. We will use a tidyverse
library for it. It has the following functions:
replace_na()
: Replaces the missing values.na.omit()
: Removes the missing values.
Note that both functions return a new DataFrame. If you want to update the original data
, you'll need to re-assign it
Let's apply these strategies:
- Replacing:
R1library(tidyverse) 2 3# Fill missing values with 0 4data <- replace_na(data, list("A" = 0, "B" = 0, "C" = 0)) 5print(data)
Output is:
1 A B C 21 2 5 12 32 4 0 13 43 0 7 14 54 8 9 0
The missing values were replaced with 0
.
- Removing:
R1# Remove rows with missing values 2data <- na.omit(data) 3print(data)
Output is:
1 A B C 21 2 5 12
In the latter example, all rows with NA
were deleted, leaving us with just one row.
Note that when you load the tidyverse
library, you get an error message. This message shows the versions of the core packages loaded and highlights any function name conflicts between packages. For these courses this conflicts is not a problem, so you may ignore it. But generally a systematic way to handle conflicts is to use the conflicted
library. However, we won't cover it in this particular course.
The replace_na()
function applies to the entire dataset, which might not always be ideal. In most instances, you'll want to handle missing values in specific columns individually. In an R dataframe df, columns can be accessed like list elements:
R1library(tidyverse) 2 3# Fill missing values of column "A" with 0 4data$A <- replace_na(data$A, 0) 5print(data)
Output is:
1 A B C 21 2 5 12 32 4 NA 13 43 0 7 14 54 8 9 NA
Now, there are no more missing values in the "A"
column.
Missing values in real-world datasets are inevitable. Whether it's a company's financial data or a hospital's patient records, missing values are present and need to be properly managed, as they significantly affect the outcome of our analysis.
A common technique for filling missing values is to use the average value. Here's an example using incomplete age
values:
R1library(tidyverse) 2 3# Create a simple dataframe 4data <- data.frame( 5 "name" = c('Alice', 'Bob', 'Charlie', 'David', 'Eve'), 6 "age" = c(25, NA, 35, NA, 45) 7) 8 9# Filling missing values with mean 10mean_age <- mean(data$age, na.rm = TRUE) 11data$age <- replace_na(data$age, mean_age) 12 13print(data)
Output is:
1 name age 21 Alice 25 32 Bob 35 43 Charlie 35 54 David 35 65 Eve 45
In the above example, we first create a dataframe with names and ages, where some age values are missing (represented as NA
in the dataframe). To fill the missing age values with the mean age, we use the replace_na()
function, with the parameter being mean(data$age, na.rm = TRUE)
. The mean()
function ignores missing values (i.e., NA
s) when the na.rm
parameter is set to TRUE
; hence, it works correctly without any workarounds.
Congratulations! You've learned how to identify and handle missing values using R. Now, prepare for some hands-on exercises to apply these concepts to various datasets. This lesson is an opportunity to solidify your understanding and refine your skills in managing missing values. Happy coding!