Greetings, data enthusiast! Today, we're diving into descriptive statistics using R
. We'll explore measures of centrality — the mean, median, and mode — using R
's built-in functionalities.
Central tendency finds a 'typical' value in a dataset. Our three components — the mean (average), median (mid-point), and mode (most frequently occurring) — each offer a unique perspective on centrality. The mean
indicates average performance when decoding students' scores, while the median
represents the performance of the middle student, and the mode
illuminates the most common score.
This plot depicts the mean
of a given dataset or its centered location, also considered the 'average'. Imagine a seesaw balanced at its center - the mean
of a dataset is where it finds balance. As a crucial statistical concept, it visually expresses where most of our data is centered or skewed.
Our dataset is a list of individuals' ages: c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23)
. Remember, understanding your data upfront is vital for meaningful analysis.
Calculating the mean
involves adding all the numbers together and then dividing by the count. Here's how to compute it using R
:
R1data <- c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23) 2mean_val <- mean(data) # calculates the mean 3cat("Mean: ", round(mean_val, 2)) # Mean: 22.82
R
calculates the median
, the 'middle' value in an ordered dataset, using the built-in function median()
. Here is how to do it:
R1data <- c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23) 2median_val <- median(data) # calculates the median 3cat("Median: ", median_val) # Median: 23
The mode
represents the most frequently occurring number(s) in a dataset. Since R
doesn't have a built-in function to calculate mode
, we'll have to create one:
R1getmode <- function(v) { 2uniqv <- unique(v) 3uniqv[which.max(tabulate(match(v, uniqv)))] 4} 5 6data <- c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23) 7mode_val <- getmode(data) # calculates the mode 8cat("Mode: ", mode_val) # Mode: 23
This custom function, getmode()
, takes a vector as input and returns the mode of that vector, i.e., the value that appears most frequently. Unique values are extracted from the vector, and the function then counts how often each unique value appears. The unique value with the highest count (most frequently appearing) is returned as the mode.
Great job so far! Now let's delve into an interesting concept: how the mode
function we've defined in R
handles ties or duplicate modes. Consider this dataset: c(20, 21, 21, 23, 23, 24)
. In it, 21
and 23
both appear twice and are both modes.
To calculate the mode with our function:
R1data <- c(20, 21, 21, 23, 23, 24) 2mode_val <- getmode(data) 3cat("Mode: ", mode_val) # Mode: 21
Even though 21
and 23
are both modes, our calculation only returned 21
. Why is that?
In cases of ties, the function getmode()
returns the first occurring value among the tied modes. As such, it picked 21
over 23
because 21
is encountered first when scanning the data from left to right.
Your choice of measure of central tendency depends on the nature of your data. For numerical data, the mean
is susceptible to outliers, i.e., extreme values, making the median
a more preferable alternative. The mode
is undefined when no particular value repeats, or when all values repeat with equal frequency. For categorical data, the mode
is the only meaningful measure.
Kudos! You've mastered the measures of central tendency and have learned how to compute them using R
! Stay tuned for some hands-on exercises for deeper reinforcement. Onward!