Ready for our next lesson? Today, we're delving into quantiles and the Interquartile Range (IQR). Quantiles divide our data into equal parts, and the IQR reveals where half of our data lies. These tools aid us in understanding the distribution of our data and in identifying outliers. With Python's pandas
and NumPy
libraries, we'll explore how to calculate these measures.
Quantiles segment data into equal intervals. For example, when we divide a group of student grades into four equal parts, we employ quartiles (Q1 - 25th percentile, Q2 - 50th percentile or median, and Q3 - 75th percentile).
The Interquartile Range (IQR) shows where half of our data lies. It's resistant to outliers; for instance, when analyzing salaries, the IQR omits extreme values, thereby depicting the range where most salaries fall.
Python's NumPy
function, percentile()
, calculates quantiles.
Quantiles are essentially just cuts at specific points in your data when it's sorted in ascending order. The first quartile (Q1) is the point below which 25% of the data falls, while the third quartile (Q3) is the point below which 75% of the data falls. The second quartile or the median is the mid-point of the data when it's sorted in ascending order.
These values are important in identifying the spread and skewness of your data. Let's consider a dataset of student scores:
Python1import numpy as np 2 3scores = np.array([76, 85, 67, 45, 89, 70, 92, 82]) 4 5# Calculate median 6median_w1 = np.percentile(scores, 50) 7print(median_w1) # Output: 79.0 8# Check if it is the same as median 9median_w2 = np.median(scores) 10print(median_w2) # Output 79.0 11 12# Calculate Q1 and Q3 13Q1 = np.percentile(scores, 25) 14print(Q1) # Output: 69.25 15Q3 = np.percentile(scores, 75) 16print(Q3) # Output: 86.0
Here, percentile()
is used to calculate the 1st, 2nd and 3rd quartiles. When we input 25, the function gives us the value below which 25% of the data lies, i.e., the first quartile Q1. Similarly, when we input 75, it gives the third quartile Q3. The 50th percentile is the median of the dataset.
The Interquartile Range (IQR
) is computed as Q3 - Q1
.
Python1import pandas as pd 2import numpy as np 3 4math_scores = pd.DataFrame({ 5 'Name': ['Jerome', 'Jessica', 'Jeff', 'Jennifer', 'Jackie', 'Jimmy', 'Joshua', 'Julia'], 6 'Score': [56, 13, 54, 48, 49, 100, 62, 55] 7}) 8 9# IQR for scores 10Q1 = np.percentile(math_scores['Score'], 25) 11Q3 = np.percentile(math_scores['Score'], 75) 12IQR = Q3 - Q1 13print(IQR_score) # Output: 8.75
The IQR represents the range within which the middle half of the scores fall. It exposes potential outliers, defined as values that either lie below Q1 - 1.5 * IQR
or above Q3 + 1.5 * IQR
. Multiplying the IQR
by 1.5
roughly sets a boundary that encapsulates 99.3
% of the data assuming a normal distribution. So anything outside this range could be viewed as potential outliers.
This boundary of 1.5
times the IQR
is a generally accepted rule of thumb and helps to balance between being overly sensitive to slight deviations in the data versus not being sensitive enough to detect potential anomalies or outliers. This rule is particularly useful when data is large and complex when it's hard to discern outliers just by observation.
Let's select and print out all the outliers using the rule above. We will apply NumPy
's boolean selection, which works just fine with pandas
:
Python1scores = math_scores['Score'] # to simplify next expression 2outliers_scores = scores[(scores < Q1 - 1.5 * IQR) | (scores > Q3 + 1.5 * IQR)] 3print(outliers_scores) # Outputs 13 and 100
Congratulations! You've learned about two key statistical measures: quantiles and the Interquartile Range, as well as how to calculate them using Python.
In the next lesson, we'll practice these concepts; prepare for some hands-on exercises. Practice aids in mastering these concepts. Let's get started. Are you ready for the next lesson? Happy learning!