Welcome back! Our journey into Descriptive Statistics continues with Measures of Dispersion. These measures, including range, variance and standard deviation, inform us about the extent to which our data is spread out. We'll use Python's numpy
and pandas
libraries to paint a comprehensive picture of our data's dispersion. Let's dive right in!
Measures of Dispersion capture the spread within a dataset. For example, apart from knowing the average test scores (a Measure of Centrality), understanding the ways in which the scores vary from the average provides a fuller picture. This enhanced comprehension is vital in everyday data analysis.
This graph illustrates two normal distributions with varying standard deviations. Standard deviation measures how much each data point deviates from the average. Notice the curve's width under each distribution: a smaller spread (blue curve) reflects a smaller standard deviation, where most of the data points are closer to the mean. In contrast, a wider spread (green curve) signifies a greater standard deviation and that data points vary more widely around the mean.
The Range, simply the difference between the highest and lowest values, illustrates the spread between the extremes of our dataset. Python's numpy
library has a function, ptp()
(peak to peak), to calculate the range. Here are the test scores of five students:
Python1import numpy as np 2 3# Test scores of five students 4scores = np.array([72, 88, 80, 96, 85]) 5 6# Calculate and print the Range 7range_scores = np.ptp(scores) 8print(f"Range of scores: {range_scores}") # Range of scores: 24
The result "Range of scores: 24", derived from 96 - 72
, tells us how widely the extreme scores are spread out.
Variance, another Measure of Dispersion, quantifies the degree to which data values differ from the mean. High variance signifies that data points are spread out; conversely, low variance indicates closeness. We calculate the variance using numpy
's var()
function:
Python1import numpy as np 2 3# Test scores of five students 4scores = np.array([72, 88, 80, 96, 85]) 5 6# Calculate and print the Variance 7variance_scores = np.var(scores) 8print(f"Variance of scores: {variance_scores}") # Variance of scores: 64.16
Our output demonstrates the level of variability from the average.
Standard Deviation is rooted in Variance as it is simply the square root of Variance. It is essentially a measure of how much each data point differs from the mean or average. We can compute it through the std()
function available in numpy
.
Python1import numpy as np 2 3# Test scores of five students 4scores = np.array([72, 88, 80, 96, 85]) 5 6# Calculate and print the Standard Deviation 7std_scores = np.std(scores) 8print(f"Standard deviation of scores: {std_scores}") # Standard deviation of scores: 8.01
Why is standard deviation important when we already have variance? Compared to variance, standard deviation is expressed in the same units as the data, making it easier to interpret. Additionally, standard deviation is frequently used in statistical analysis because data within one standard deviation of the mean account for approximately 68% of the set, while within two standard deviations cover around 95%. These percentages aid in understanding data dispersion in a probability distribution. Therefore, while variance provides numerical insight into data spread, standard deviation conveys these insights in a more comprehensible and applicable manner.
Great job! You've just delved into Measures of Dispersion! These skills will assist you in better interpreting and visualizing data. Remember, hands-on practice solidifies learning. Stay tuned for some practice exercises. Now, let's dive further into exploring our data!