Lesson 3

Ready for our next lesson? Today, we're delving into **quantiles** and the **Interquartile Range** (IQR). Quantiles divide our data into equal parts, and the IQR reveals where half of our data lies. These tools aid us in understanding the distribution of our data and in identifying outliers. With Python's `pandas`

and `NumPy`

libraries, we'll explore how to calculate these measures.

Quantiles segment data into equal intervals. For example, when we divide a group of student grades into four equal parts, we employ quartiles (Q1 - 25th percentile, Q2 - 50th percentile or median, and Q3 - 75th percentile).

The **Interquartile Range** (IQR) shows where half of our data lies. It's resistant to outliers; for instance, when analyzing salaries, the IQR omits extreme values, thereby depicting the range where most salaries fall.

Python's `NumPy`

function, `percentile()`

, calculates quantiles.

Quantiles are essentially just cuts at specific points in your data when it's sorted in ascending order. The first quartile (Q1) is the point below which 25% of the data falls, while the third quartile (Q3) is the point below which 75% of the data falls. The second quartile or the median is the mid-point of the data when it's sorted in ascending order.

These values are important in identifying the spread and skewness of your data. Let's consider a dataset of student scores:

Python`1import numpy as np 2 3scores = np.array([76, 85, 67, 45, 89, 70, 92, 82]) 4 5# Calculate median 6median_w1 = np.percentile(scores, 50) 7print(median_w1) # Output: 79.0 8# Check if it is the same as median 9median_w2 = np.median(scores) 10print(median_w2) # Output 79.0 11 12# Calculate Q1 and Q3 13Q1 = np.percentile(scores, 25) 14print(Q1) # Output: 69.25 15Q3 = np.percentile(scores, 75) 16print(Q3) # Output: 86.0`

Here, `percentile()`

is used to calculate the 1st, 2nd and 3rd quartiles. When we input 25, the function gives us the value below which 25% of the data lies, i.e., the first quartile Q1. Similarly, when we input 75, it gives the third quartile Q3. The 50th percentile is the median of the dataset.

The **Interquartile Range** (`IQR`

) is computed as `Q3 - Q1`

.

Python`1import pandas as pd 2import numpy as np 3 4math_scores = pd.DataFrame({ 5 'Name': ['Jerome', 'Jessica', 'Jeff', 'Jennifer', 'Jackie', 'Jimmy', 'Joshua', 'Julia'], 6 'Score': [56, 13, 54, 48, 49, 100, 62, 55] 7}) 8 9# IQR for scores 10Q1 = np.percentile(math_scores['Score'], 25) 11Q3 = np.percentile(math_scores['Score'], 75) 12IQR = Q3 - Q1 13print(IQR_score) # Output: 8.75`

The IQR represents the range within which the middle half of the scores fall. It exposes potential outliers, defined as values that either lie below `Q1 - 1.5 * IQR`

or above `Q3 + 1.5 * IQR`

. Multiplying the `IQR`

by `1.5`

roughly sets a boundary that encapsulates `99.3`

% of the data assuming a normal distribution. So anything outside this range could be viewed as potential outliers.

This boundary of `1.5`

times the `IQR`

is a generally accepted rule of thumb and helps to balance between being overly sensitive to slight deviations in the data versus not being sensitive enough to detect potential anomalies or outliers. This rule is particularly useful when data is large and complex when it's hard to discern outliers just by observation.

Let's select and print out all the outliers using the rule above. We will apply `NumPy`

's boolean selection, which works just fine with `pandas`

:

Python`1scores = math_scores['Score'] # to simplify next expression 2outliers_scores = scores[(scores < Q1 - 1.5 * IQR) | (scores > Q3 + 1.5 * IQR)] 3print(outliers_scores) # Outputs 13 and 100`

Congratulations! You've learned about two key statistical measures: quantiles and the **Interquartile Range**, as well as how to calculate them using Python.

In the next lesson, we'll practice these concepts; prepare for some hands-on exercises. Practice aids in mastering these concepts. Let's get started. Are you ready for the next lesson? Happy learning!