Lesson 1

Greetings, data enthusiast! Today, we are diving into **descriptive statistics** using Python. We'll be exploring measures of centrality — mean, median, and mode — using Python libraries `numpy`

and `pandas`

.

A central tendency finds a '*typical*' value in a dataset. Our three components — the **mean** (average), **median** (mid-point), and **mode** (most frequently appearing) — each offer a unique perspective on centrality. The mean indicates average performance when decoding students' scores, while the median represents the middle student's performance, and the mode highlights the most common score.

This plot represents a given dataset's mean or centered location, also considered the 'average'. Imagine a seesaw balancing at its center - the mean of a dataset is where it balances out. It is a crucial statistical concept and visually helps identify where most of our data is centered around or leaning toward.

Our dataset is a list of individuals' ages: `[23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23]`

. Remember, understanding your data upfront is key to conducting a meaningful analysis.

Calculating the mean involves adding all numbers together and then dividing by the count. Here's how you compute it in Python:

Python`1import numpy as np 2 3data = np.array([23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23]) 4mean = np.mean(data) # calculates the mean 5print("Mean: ", round(mean, 2)) # Mean: 22.82`

The median is the 'middle' value in an ordered dataset. This is how it is computed in Python:

Python`1import numpy as np 2 3data = np.array([23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23]) 4median = np.median(data) # calculates the median 5print("Median: ", median) # Median: 23.0`

The `mode`

represents the most frequently occurring number(s) in a dataset. To compute it, we use the `mode`

function from the `scipy`

library:

Python`1from scipy import stats 2 3data = np.array([23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23]) 4mode_age = stats.mode(data) # calculates the mode 5print("Mode: ", mode_age.mode) # Mode: 23`

Note, that calculated `mode_age`

is an object. To retrieve the actual value from it, we use the `.mode`

attribute of this object. So, resulting line is `mode_age.mode`

.

`NumPy`

doesn't have a function for calculating mode, so we are using the `SciPy`

module here. We will talk more about this module and its capabilities in the future lessons.

Great job so far! Now let's explore an interesting concept: how the `mode`

function from `scipy.stats`

handles *ties* or duplicate modes.

So, what's a tie in mode? Imagine we have two or more different numbers appearing the same number of times in our dataset. For instance, consider this dataset: `[20, 21, 21, 23, 23, 24]`

. Here, `21`

and `23`

both appear twice and are therefore modes.

Let's calculate the mode using `scipy.stats`

:

Python`1from scipy import stats 2import numpy as np 3 4data = np.array([20, 21, 21, 23, 23, 24]) 5mode = stats.mode(data) 6print("Mode: ", mode.mode) # Mode: 21`

Although `21`

and `23`

are both modes, our calculation only returned `21`

. Why is that?

In cases of ties, `scipy.stats.mode()`

returns the **smallest** value amongst the tied modes. So in this case, it picked `21`

over `23`

because `21`

is the smaller value.

Your choice of measure of central tendency depends on the nature of your data. For numerical data, the mean is susceptible to outliers, i.e., extreme values, making the median a preferable measure. The mode is undefined when no particular value repeats, or all values repeat with equal frequency. For categorical data, the mode is the only meaningful measure.

Kudos! You have mastered the measures of central tendency and have learned how to compute them using `Python`

! Stay tuned for some hands-on exercises for deeper reinforcement. Onward!