Are you prepared for the next chapter in our statistical journey with R? In this lesson, we're focusing on quantiles and the Interquartile Range (IQR). Quantiles divide our data into equal parts, and the IQR signifies the range within which half of our data lies. Understanding these tools is vital for describing the distribution of data and detecting outliers. By leveraging R’s built-in functions along with the dplyr
package, we'll embark on the journey of calculating these measures.
Quantiles segment data into equal intervals. Take, for instance, when student scores are divided into quartiles (four equal parts). These are commonly denoted as Q1 (the first 25%, or the 25th percentile, the point below which 25% of the data falls), Q2 (or the median, marking the middle point or 50th percentile), and Q3 (representing the 75th percentile, or the point below which 75% of the data falls).
The Interquartile Range (IQR) simply illustrates the zone within which half of our data lies. Because it is resistant to outliers, it becomes essential when analysing data. For instance, an IQR analysis in a salary distribution would eliminate extreme values, thereby providing a truthful depiction of the range within which most salaries fall.
In R, we use the quantile()
function to calculate quantiles. In a sorted data array, quantiles are derived at specific points. Q1 is the point below which 25% of the data falls, while Q3 is the point below which 75% of the data falls. Q2, or the median, is the mid-point of the data.
These critical values assist in identifying the spread or skewness in our dataset. Let's consider a dataset of student scores:
R1scores <- c(76, 85, 67, 45, 89, 70, 92, 82) 2 3# Calculate Q1, Q2 (the median) and Q3 4Q1 <- quantile(scores, .25) 5cat("Q1:", Q1, "\n") # 69.25 6 7Q2 <- quantile(scores, .5) 8cat("Q2 (median):", Q2, "\n") # 79 9 10Q3 <- quantile(scores, .75) 11cat("Q3:", Q3, "\n") #86 12
In this code, we use the quantile()
function to calculate Q1, Q2, and Q3. quantile(scores, .25)
returns Q1, the point below which 25% of the data lies. Similarly, quantile(scores, .75)
gives us Q3. Note that Q2 is the data's median, because it divides the data into two equally sized parts.
The Interquartile Range (IQR) is computed as Q3 - Q1
. We can use the IQR()
function to calculate this:
R1math_scores <- data.frame( 2 Name = c('Jerome', 'Jessica', 'Jeff', 'Jennifer', 'Jackie', 'Jimmy', 'Joshua', 'Julia'), 3 Score = c(56, 13, 54, 48, 49, 100, 62, 55) 4) 5 6# IQR for scores 7IQR_scores <- IQR(math_scores$Score) 8cat("IQR Score:", IQR_scores, "\n") # IQR Score: 8.75
The IQR represents the zone in which the middle half of the scores resides. Detecting potential outliers becomes clear — these could be values lying below Q1 - 1.5 * IQR
or beyond Q3 + 1.5 * IQR
. The multiplication of IQR
by 1.5
sets a reasonable boundary that encapsulates about 99.3% of the data, assuming a normal distribution. Values outside this range are considered as potential outliers.
Now, let’s identify and print out all the outliers using the rule we've defined above. We will use the filter()
function from the dplyr
package:
R1library(dplyr) 2 3outliers <- filter(math_scores, Score < (Q1 - 1.5 * IQR_scores) | Score > (Q3 + 1.5 * IQR_scores)) 4print(outliers) 5 6# Name Score 7# 1 Jessica 13 8# 2 Jimmy 100
Kudos to you! You've grasped two crucial statistical measures: quantiles and the Interquartile Range and learned how to compute them using R.
In the upcoming lesson, we will put these concepts into practice. Prepare for hands-on exercises to gain mastery of these vital concepts. So, let’s get started. Are you ready? Happy learning!