Welcome back! Our journey into Descriptive Statistics continues with Measures of Dispersion. These measures, which include the range, variance, and standard deviation, inform us about the extent to which our data is spread. R's built-in statistical functions offer all we need to thoroughly understand dispersion in our data. Let's dive right in!
Measures of Dispersion capture the spread within a dataset. For example, knowing the average test scores (a Measure of Centrality) isn't enough. Understanding how those scores vary from the average provides a fuller picture. This enhanced comprehension is vital for daily data analysis.
The graph below illustrates two normal distributions with varying standard deviations. A standard deviation measures how much each data point deviates from the average. Observe the width of the curve under each distribution: a smaller spread, reflected by the blue curve, corresponds to a smaller standard deviation. Most of the data points are closer to the mean. In contrast, the wider spread, denoted by the green curve, reveals a greater standard deviation and suggests that data points vary more widely around the mean.
The range, simply the difference between the highest and lowest values, illustrates the spread between the extremes of our dataset. We can calculate the range of a set of numbers using R's built-in function, range()
. Here, we calculate the range of test scores for five students:
R1# Test scores of five students 2scores <- c(72, 88, 80, 96, 85) 3 4# Calculate and print the Range 5range_scores <- diff(range(scores)) 6print(paste("Range of scores:", range_scores)) # Range of scores: 24
The result "Range of scores: 24", derived from 96 - 72
, indicates the extent of the spread between the extreme scores.
Variance, another Measure of Dispersion, quantifies the degree to which data values differ from the mean. High variance signifies that data points are spread out, while low variance indicates closeness.
The formula for the variance is the following:
This formula measures the average of the squared differences from the Mean (). It quantifies how spread out the data points are.
We can calculate the variance using R's built-in var()
function:
R1# Test scores of five students 2scores <- c(72, 88, 80, 96, 85) 3 4# Calculate and print the Variance 5variance_scores <- var(scores) 6print(paste("Variance of scores:", variance_scores)) # Variance of scores: 80.2
The output helps us understand the level of variability from the average.
Rooted in Variance, the Standard Deviation is the square root of the Variance. It measures how much each data point differs from the mean or average. We can compute it using the sd()
function available in R.
R1# Test scores of five students 2scores <- c(72, 88, 80, 96, 85) 3 4# Calculate and print the Standard Deviation 5std_scores <- sd(scores) 6print(paste("Standard deviation of scores:", std_scores)) # Standard deviation of scores: 8.96
Why is the standard deviation important when we already have variance? Compared to variance, the standard deviation is expressed in the same units as the data, making it easier to interpret. Additionally, the standard deviation is frequently used in statistical analysis because data within one standard deviation of the mean account for approximately 68% of the set, while data within two standard deviations cover around 95%. These percentages aid our understanding of data dispersion in a probability distribution. Therefore, while variance provides numerical insight into data spread, the standard deviation conveys these insights in a more comprehensible and applicable manner.
Great job! You've just dived into Measures of Dispersion with R! These skills will assist you in better interpreting and visualizing data. Remember, hands-on practice solidifies learning. Stay tuned for some practice exercises. Now, let's delve further into exploring our data!