Understanding Your Data

Lesson 2

Introduction to Understanding Your Data

Understanding your data is like plotting key positions on a map for a journey. How do we do it? Through statistical quantities like mean, median, and mode. These tell us more about our data. Today, we will learn about these quantities and how to calculate them in pandas.

Basic Statistical Quantities

In data analysis, mean, median, and mode help us understand the central tendency of our data. The mean is the average, the median is the middle value when data is sorted in order, and the mode is the most frequent value. Standard deviation and variance tell us how data varies and the difference between each quantity and the mean. Additionally, we use minimum (min), maximum (max), and quantiles to understand data spread.

As a reminder, quantiles divide data into equal-sized intervals and help understand its distribution. One example of quantile is quartiles, which divide data into 4 equally sized groups. For instance, the first quartile is the value which is greater than the 25% of the data.

Calculation of Statistical Quantities in Pandas

Knowing our destinations, let's see how to reach them using pandas! We compute these quantities for a DataFrame or Series object data using:

mean = data.mean(),
median = data.median(),
mode = data.mode(),
standard deviation = data.std(),
variance = data.var(),
min = data.min(),
max = data.max(),
quantile = data.quantile(q), where q is the quantile like 0.25, 0.5, etc.

Let's calculate these for a sample DataFrame:

Python
1import pandas as pd
2
3# DataFrame creation
4data = pd.DataFrame({
5    'friends': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve'],
6    'scores': [93, 89, 82, 88, 94],
7    'age': [20, 21, 20, 19, 21]
8})
9
10# Print statistics
11print('Mean:', data['scores'].mean())  # 89.2
12print('Median:', data['scores'].median())  # 89.0
13print('Mode:', data['scores'].mode()[0])  # 82
14print('Standard Deviation:', data['scores'].std())  # 4.764451
15print('Variance:', data['scores'].var())  # 22.7
16print('Min:', data['scores'].min())  # 82
17print('Max:', data['scores'].max())  # 94
18print('25% Quantile:', data['scores'].quantile(0.25))  # 88.0

DataFrame Describe Function

pandas provides describe(), which computes these statistical quantities for each DataFrame column. Here's how it looks:

Python
1# Describe function usage
2print(data.describe())
3'''
4          scores       age
5count   5.000000   5.00000
6mean   89.200000  20.20000
7std     4.764452   0.83666
8min    82.000000  19.00000
925%    88.000000  20.00000
1050%    89.000000  20.00000
1175%    93.000000  21.00000
12max    94.000000  21.00000
13'''

Lesson Summary and Practice

Today, you learned about mean, median, mode, standard deviation, variance, min, max, and quantiles, and how these are calculated using pandas. You saw describe() being used on real-world data.

However, learning doesn't stop at understanding but is solidified in practice. So, get set for some hands-on reinforcement through exercises. Happy exploring!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.