Greetings, data enthusiast! Today, we are diving into descriptive statistics using Python. We'll be exploring measures of centrality — mean, median, and mode — using Python libraries numpy
and pandas
.
A central tendency finds a 'typical' value in a dataset. Our three components — the mean (average), median (mid-point), and mode (most frequently appearing) — each offer a unique perspective on centrality. The mean indicates average performance when decoding students' scores, while the median represents the middle student's performance, and the mode highlights the most common score.
This plot represents a given dataset's mean or centered location, also considered the 'average'. Imagine a seesaw balancing at its center - the mean of a dataset is where it balances out. It is a crucial statistical concept and visually helps identify where most of our data is centered around or leaning toward.
Our dataset is a list of individuals' ages: [23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23]
. Remember, understanding your data upfront is key to conducting a meaningful analysis.
Calculating the mean involves adding all numbers together and then dividing by the count. Here's how you compute it in Python:
Python1import numpy as np 2 3data = np.array([23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23]) 4mean = np.mean(data) # calculates the mean 5print("Mean: ", round(mean, 2)) # Mean: 22.82
The median is the 'middle' value in an ordered dataset. This is how it is computed in Python:
Python1import numpy as np 2 3data = np.array([23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23]) 4median = np.median(data) # calculates the median 5print("Median: ", median) # Median: 23.0
The mode
represents the most frequently occurring number(s) in a dataset. To compute it, we use the mode
function from the scipy
library:
Python1from scipy import stats 2 3data = np.array([23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23]) 4mode_age = stats.mode(data) # calculates the mode 5print("Mode: ", mode_age.mode) # Mode: 23
Note, that calculated mode_age
is an object. To retrieve the actual value from it, we use the .mode
attribute of this object. So, resulting line is mode_age.mode
.
NumPy
doesn't have a function for calculating mode, so we are using the SciPy
module here. We will talk more about this module and its capabilities in the future lessons.
Great job so far! Now let's explore an interesting concept: how the mode
function from scipy.stats
handles ties or duplicate modes.
So, what's a tie in mode? Imagine we have two or more different numbers appearing the same number of times in our dataset. For instance, consider this dataset: [20, 21, 21, 23, 23, 24]
. Here, 21
and 23
both appear twice and are therefore modes.
Let's calculate the mode using scipy.stats
:
Python1from scipy import stats 2import numpy as np 3 4data = np.array([20, 21, 21, 23, 23, 24]) 5mode = stats.mode(data) 6print("Mode: ", mode.mode) # Mode: 21
Although 21
and 23
are both modes, our calculation only returned 21
. Why is that?
In cases of ties, scipy.stats.mode()
returns the smallest value amongst the tied modes. So in this case, it picked 21
over 23
because 21
is the smaller value.
Your choice of measure of central tendency depends on the nature of your data. For numerical data, the mean is susceptible to outliers, i.e., extreme values, making the median a preferable measure. The mode is undefined when no particular value repeats, or all values repeat with equal frequency. For categorical data, the mode is the only meaningful measure.
Kudos! You have mastered the measures of central tendency and have learned how to compute them using Python
! Stay tuned for some hands-on exercises for deeper reinforcement. Onward!