Descriptive Statistics with R: Understanding Measures of Central Tendency

Lesson 1

Introduction to Descriptive Statistics and R

Greetings, data enthusiast! Today, we're diving into descriptive statistics using R. We'll explore measures of centrality — the mean, median, and mode — using R's built-in functionalities.

Understanding Central Tendency

Central tendency finds a 'typical' value in a dataset. Our three components — the mean (average), median (mid-point), and mode (most frequently occurring) — each offer a unique perspective on centrality. The mean indicates average performance when decoding students' scores, while the median represents the performance of the middle student, and the mode illuminates the most common score.

Visualizing Central Tendency

This plot depicts the mean of a given dataset or its centered location, also considered the 'average'. Imagine a seesaw balanced at its center - the mean of a dataset is where it finds balance. As a crucial statistical concept, it visually expresses where most of our data is centered or skewed.

Setting up the Dataset

Our dataset is a list of individuals' ages: c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23). Remember, understanding your data upfront is vital for meaningful analysis.

Computing Mean using R

Calculating the mean involves adding all the numbers together and then dividing by the count. Here's how to compute it using R:

R
1data <- c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23)
2mean_val <- mean(data)  # calculates the mean
3cat("Mean: ", round(mean_val, 2))  # Mean:  22.82

Computing Median using R

R calculates the median, the 'middle' value in an ordered dataset, using the built-in function median(). Here is how to do it:

R
1data <- c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23)
2median_val <- median(data)  # calculates the median
3cat("Median: ", median_val)  # Median:  23

Computing Mode using R

The mode represents the most frequently occurring number(s) in a dataset. Since R doesn't have a built-in function to calculate mode, we'll have to create one:

R
1getmode <- function(v) {
2uniqv <- unique(v)
3uniqv[which.max(tabulate(match(v, uniqv)))]
4}
5
6data <- c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23)
7mode_val <- getmode(data)  # calculates the mode
8cat("Mode: ", mode_val)  # Mode: 23

This custom function, getmode(), takes a vector as input and returns the mode of that vector, i.e., the value that appears most frequently. Unique values are extracted from the vector, and the function then counts how often each unique value appears. The unique value with the highest count (most frequently appearing) is returned as the mode.

Handling Ties in Mode with R

Great job so far! Now let's delve into an interesting concept: how the mode function we've defined in R handles ties or duplicate modes. Consider this dataset: c(20, 21, 21, 23, 23, 24). In it, 21 and 23 both appear twice and are both modes.

To calculate the mode with our function:

R
1data <- c(20, 21, 21, 23, 23, 24)
2mode_val <- getmode(data)
3cat("Mode: ", mode_val)  # Mode: 21

Even though 21 and 23 are both modes, our calculation only returned 21. Why is that?

In cases of ties, the function getmode() returns the first occurring value among the tied modes. As such, it picked 21 over 23 because 21 is encountered first when scanning the data from left to right.

Choice of Measure of Central Tendency

Your choice of measure of central tendency depends on the nature of your data. For numerical data, the mean is susceptible to outliers, i.e., extreme values, making the median a more preferable alternative. The mode is undefined when no particular value repeats, or when all values repeat with equal frequency. For categorical data, the mode is the only meaningful measure.

Wrapping Up

Kudos! You've mastered the measures of central tendency and have learned how to compute them using R! Stay tuned for some hands-on exercises for deeper reinforcement. Onward!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.