In this lesson, we'll keep exploring the power of the groupby
function in the Pandas library. Groupby
is a crucial tool for data analysis, allowing us to split data into different groups and then apply aggregates to those groups. This can be very useful in numerous real-life applications, such as summarizing sales data by product and region or understanding passenger statistics in a Titanic dataset.
Our goal today is to understand how to use the groupby
function in Pandas for more advanced, multi-level aggregations. We'll work through an example involving grouping by multiple columns and applying multiple aggregation functions to several fields.
Before diving into complex groupby
operations, let's review the basics. The groupby
function in Pandas is used to split the data into groups based on some criteria. You can then apply various aggregation functions to these groups.
Let's start with a basic example. Suppose we have a simple dataset about students and their scores.
Python1import pandas as pd 2 3# Simple dataset 4data = { 5 'student': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'], 6 'subject': ['Math', 'Math', 'Math', 'English', 'English', 'English'], 7 'score': [85, 90, 95, 88, 93, 97] 8} 9 10df = pd.DataFrame(data) 11 12# Basic groupby operation 13grouped = df.groupby('student')['score'].mean() 14print("\nAverage score per student:") 15print(grouped) 16# Output: 17# student 18# Alice 86.5 19# Bob 91.5 20# Charlie 96.0 21# Name: score, dtype: float64
In this example, we grouped the DataFrame by student
and calculated the mean score for each student. This is a fundamental operation that helps in summarizing the data efficiently.
Now that we understand the basics, let's move on to more complex groupby
operations. Sometimes, you might want to group data by multiple columns. For instance, in the Titanic dataset, you might want to analyze data based on both the class of the passenger and the town they embarked from.
Grouping by multiple columns allows for more detailed summaries and insights from the data. Consider the following example: We group the Titanic dataset by class
and embark_town
and then apply multiple aggregation functions to different columns.
Python1import seaborn as sns 2 3titanic = sns.load_dataset('titanic') 4 5# Detailed grouping with multiple aggregations 6grouped_details = titanic.groupby(['class', 'embark_town'], observed=True).agg({ 7 'fare': ['mean', 'max', 'min'], 8 'age': ['mean', 'std', 'count'] 9}) 10print(grouped_details)
Note the observed=True
parameter. By default, groupby
includes all possible combinations of the grouping columns, even if some combinations do not appear in the data. For example, imagine there are no passengers of the first class embarking from the "Queenstown". Though this combination is possible, it won't show up in the dataset.
Setting observed=True
ensures the result only includes the combinations observed in the data, which can make the output more concise and easier to interpret. Also, in the future versions of pandas, the observed
will be equal to True
by default.
Let's break down this example step-by-step:
-
Group by Multiple Columns:
titanic.groupby(['class', 'embark_town'])
- We first group the data by
class
andembark_town
. This means that we will have a separate group for each combination of class and embarkation town.
- We first group the data by
-
Apply Different Aggregations:
.agg({ ... })
- Inside the
agg
function, we specify the columns and the aggregation functions we want to apply. For thefare
column, we calculate the mean, maximum, and minimum values. For theage
column, we calculate the mean, standard deviation, and count.
- Inside the
This approach provides a detailed summary of the data, allowing us to understand various aspects of each group.
Here is the obtained output:
1 fare age 2 mean max min mean std count 3class embark_town 4First Cherbourg 104.718529 512.3292 26.5500 38.027027 14.243454 74 5 Queenstown 90.000000 90.0000 90.0000 38.500000 7.778175 2 6 Southampton 70.364862 263.0000 0.0000 38.152037 15.315584 108 7Second Cherbourg 25.358335 41.5792 12.0000 22.766667 10.192551 15 8 Queenstown 12.350000 12.3500 12.3500 43.500000 19.091883 2 9 Southampton 20.327439 73.5000 0.0000 30.386731 14.080001 156 10Third Cherbourg 11.214083 22.3583 4.0125 20.741951 11.712367 41 11 Queenstown 11.183393 29.1250 6.7500 25.937500 16.807938 24 12 Southampton 14.644083 69.5500 0.0000 25.696552 12.110906 290
The output shows fare
and age
statistics, grouped by class
and embark_town
. Each row represents a group, which is a unique combination of a class and an embark town. For example, the first row is the passengers of the First
class with embark town Cherbourg
. Columns show:
- Fare:
mean
,max
,min
- Age:
mean
,std
(standard deviation),count
Here are some examples of Insights we could obtain from this table:
- High-Cost Tickets: The maximum fare for First Class passengers from Cherbourg is significant (512.3292), indicating some very expensive tickets.
- Age Range: The age distribution for First Class passengers from Southampton has a high standard deviation (15.315584), suggesting a wide age range.
- Passenger Numbers: Most Third Class passengers embarked from Southampton (290), a higher count than from Cherbourg (41) or Queenstown (24).
Such detailed groupby
operations are useful in many real-life scenarios. For instance:
- Sales Analysis: Grouping sales data by region and product category to find average, maximum, and minimum sales along with the number of sales transactions.
- Customer Segmentation: Analyzing customer data by age group and region to understand spending patterns and customer distribution.
- Healthcare Data: Grouping patient data by disease type and hospital to find average treatment costs, maximum and minimum costs, and the number of patients treated.
By performing these complex groupby
operations, you can extract meaningful insights and make informed decisions based on the data.
In this lesson, we covered the following key points:
- The basics of
groupby
in Pandas. - How to perform complex
groupby
operations using multiple columns and applying multiple aggregation functions. - Practical use-cases where such detailed
groupby
operations are valuable. - Common pitfalls and tips for efficient and error-free coding.
By mastering these groupby
techniques, you will be able to perform more advanced data analysis and extract deeper insights from your datasets.
Now that you have a good understanding of complex groupby
operations, it's time to put theory into practice! In the upcoming practice session, you will apply these concepts to different datasets and tasks. This hands-on experience will reinforce your learning and help you become proficient in using groupby
for advanced data analysis. Let's get started with some exercises!