Lesson 1

Greetings, data enthusiast! Today, we're diving into **descriptive statistics** using `R`

. We'll explore measures of centrality — the mean, median, and mode — using `R`

's built-in functionalities.

Central tendency finds a '*typical*' value in a dataset. Our three components — the **mean** (average), **median** (mid-point), and **mode** (most frequently occurring) — each offer a unique perspective on centrality. The `mean`

indicates average performance when decoding students' scores, while the `median`

represents the performance of the middle student, and the `mode`

illuminates the most common score.

This plot depicts the `mean`

of a given dataset or its centered location, also considered the 'average'. Imagine a seesaw balanced at its center - the `mean`

of a dataset is where it finds balance. As a crucial statistical concept, it visually expresses where most of our data is centered or skewed.

Our dataset is a list of individuals' ages: `c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23)`

. Remember, understanding your data upfront is vital for meaningful analysis.

Calculating the `mean`

involves adding all the numbers together and then dividing by the count. Here's how to compute it using `R`

:

R`1data <- c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23) 2mean_val <- mean(data) # calculates the mean 3cat("Mean: ", round(mean_val, 2)) # Mean: 22.82`

`R`

calculates the `median`

, the 'middle' value in an ordered dataset, using the built-in function `median()`

. Here is how to do it:

R`1data <- c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23) 2median_val <- median(data) # calculates the median 3cat("Median: ", median_val) # Median: 23`

The `mode`

represents the most frequently occurring number(s) in a dataset. Since `R`

doesn't have a built-in function to calculate `mode`

, we'll have to create one:

R`1getmode <- function(v) { 2uniqv <- unique(v) 3uniqv[which.max(tabulate(match(v, uniqv)))] 4} 5 6data <- c(23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23) 7mode_val <- getmode(data) # calculates the mode 8cat("Mode: ", mode_val) # Mode: 23`

This custom function, `getmode()`

, takes a vector as input and returns the mode of that vector, i.e., the value that appears most frequently. Unique values are extracted from the vector, and the function then counts how often each unique value appears. The unique value with the highest count (most frequently appearing) is returned as the mode.

Great job so far! Now let's delve into an interesting concept: how the `mode`

function we've defined in `R`

handles *ties* or duplicate modes. Consider this dataset: `c(20, 21, 21, 23, 23, 24)`

. In it, `21`

and `23`

both appear twice and are both modes.

To calculate the mode with our function:

R`1data <- c(20, 21, 21, 23, 23, 24) 2mode_val <- getmode(data) 3cat("Mode: ", mode_val) # Mode: 21`

Even though `21`

and `23`

are both modes, our calculation only returned `21`

. Why is that?

In cases of ties, the function `getmode()`

returns the first occurring value among the tied modes. As such, it picked `21`

over `23`

because `21`

is encountered first when scanning the data from left to right.

Your choice of measure of central tendency depends on the nature of your data. For numerical data, the `mean`

is susceptible to outliers, i.e., extreme values, making the `median`

a more preferable alternative. The `mode`

is undefined when no particular value repeats, or when all values repeat with equal frequency. For categorical data, the `mode`

is the only meaningful measure.

Kudos! You've mastered the measures of central tendency and have learned how to compute them using `R`

! Stay tuned for some hands-on exercises for deeper reinforcement. Onward!