Lesson 2

Understanding Your Data

Introduction to Understanding Your Data

Understanding your data is like plotting key positions on a map for a journey. How do we do it? Through statistical quantities like mean, median, and mode. These tell us more about our data. Today, we will learn about these quantities and how to calculate them in pandas.

Basic Statistical Quantities

In data analysis, mean, median, and mode help us understand the central tendency of our data. The mean is the average, the median is the middle value when data is sorted in order, and the mode is the most frequent value. Standard deviation and variance tell us how data varies and the difference between each quantity and the mean. Additionally, we use minimum (min), maximum (max), and quantiles to understand data spread.

As a reminder, quantiles divide data into equal-sized intervals and help understand its distribution. One example of quantile is quartiles, which divide data into 4 equally sized groups. For instance, the first quartile is the value which is greater than the 25% of the data.

Calculation of Statistical Quantities in Pandas

Knowing our destinations, let's see how to reach them using pandas! We compute these quantities for a DataFrame or Series object data using:

  • mean = data.mean(),
  • median = data.median(),
  • mode = data.mode(),
  • standard deviation = data.std(),
  • variance = data.var(),
  • min = data.min(),
  • max = data.max(),
  • quantile = data.quantile(q), where q is the quantile like 0.25, 0.5, etc.

Let's calculate these for a sample DataFrame:

Python
1import pandas as pd 2 3# DataFrame creation 4data = pd.DataFrame({ 5 'friends': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve'], 6 'scores': [93, 89, 82, 88, 94], 7 'age': [20, 21, 20, 19, 21] 8}) 9 10# Print statistics 11print('Mean:', data['scores'].mean()) # 89.2 12print('Median:', data['scores'].median()) # 89.0 13print('Mode:', data['scores'].mode()[0]) # 82 14print('Standard Deviation:', data['scores'].std()) # 4.764451 15print('Variance:', data['scores'].var()) # 22.7 16print('Min:', data['scores'].min()) # 82 17print('Max:', data['scores'].max()) # 94 18print('25% Quantile:', data['scores'].quantile(0.25)) # 88.0
DataFrame Describe Function

pandas provides describe(), which computes these statistical quantities for each DataFrame column. Here's how it looks:

Python
1# Describe function usage 2print(data.describe()) 3''' 4 scores age 5count 5.000000 5.00000 6mean 89.200000 20.20000 7std 4.764452 0.83666 8min 82.000000 19.00000 925% 88.000000 20.00000 1050% 89.000000 20.00000 1175% 93.000000 21.00000 12max 94.000000 21.00000 13'''
Lesson Summary and Practice

Today, you learned about mean, median, mode, standard deviation, variance, min, max, and quantiles, and how these are calculated using pandas. You saw describe() being used on real-world data.

However, learning doesn't stop at understanding but is solidified in practice. So, get set for some hands-on reinforcement through exercises. Happy exploring!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.