Lesson 2

Mastering DataFrame Grouping in Pandas

Introduction to Grouping in pandas

Hello! In this lesson, we will explore the concept of grouping in pandas. Grouping rows of a DataFrame is a powerful tool that allows you to aggregate rows based on the values in one or more columns of your data. For example, say you have a DataFrame representing all orders in an online store, where each row is an order. You could group the DataFrame by Customer_ID to glean data about particular shoppers. Similarly, if you have a school database where each row corresponds to a student, grouping by Grade_Level can streamline data analysis for each grade. We'll exemplify this operation using a straightforward DataFrame.

Sample Dataset

Let's work with a DataFrame of individuals, each characterized by attributes such as Name, Age, and City. Here is a simple example:

Python
1import pandas as pd 2df = pd.DataFrame({ 3 "Name": ["Alex", "Bob", "Chloe", "Charlie", "Alex", "Charlie"], 4 "Age": [12, 15, 28, 55, 21, 35], 5 "City": ["New York", "Los Angeles", "Chicago", "Los Angeles", "New York", "New York"] 6})
Grouping

The groupby function in pandas is the basis of group operations. Imagine a pond filled with various types of fish. When different colored foods are dropped into the pond, every fish is drawn to a specific color. After some time, your pond will neatly sort into groups of each type of fish.

We take the same approach with our DataFrame. Using the pandas groupby method, we will group our data by City. Here's how it's done:

Python
1grouped = df.groupby("City")

Now, the grouped variable is a special pandas DataFrameGroupBy object that has divided our DataFrame into groups by city.

Exploring DataFrameGroupBy Object

The DataFrameGroupBy object holds the groups created using groupby. It's like a dictionary: each key is a unique city from our City column, and the corresponding value for each key is a DataFrame comprising all rows with that city in the City column.

For instance, to view data for "New York" only, pandas offers the get_group function:

Python
1print(grouped.get_group("New York")) 2'''Output: 3 Name Age City 40 Alex 12 New York 54 Alex 21 New York 65 Charlie 35 New York 7'''

And there you have it! The function returns a DataFrame containing only the individuals whose City is "New York".

Applying Aggregation Functions after Grouping

But we're still going! After dividing our DataFrame into groups, we can perform operations on these groups separately. A typical group operation uses aggregate functions, which take a group of values, calculate, and return a single result. Common examples include taking the sum, average (mean), maximum value (max), or minimum value (min) of a group.

After grouping, we can calculate the average age of inhabitants for each city:

Python
1print(grouped['Age'].mean()) 2'''Output: 3City 4Chicago 28.000000 5Los Angeles 35.000000 6New York 22.666667 7'''

Note how pandas skillfully calculates the mean for each group separately!

Iterating Through Groups:

Imagine we want to do something more complex than simply calculate the mean value for each group. In this case, we might want to iterate through all the groups. It is done easily with a regular for loop:

Python
1for name, group in grouped: 2 print("\nCity:", name) 3 print("Number of people:", len(group)) 4 print("Average age:", group["Age"].mean()) 5'''Output: 6City: Chicago 7Number of people: 1 8Average age: 28.0 9 10City: Los Angeles 11Number of people: 2 12Average age: 35.0 13 14City: New York 15Number of people: 3 16Average age: 22.666666666666668 17'''
Lesson Summary

Bravo! Now, you are adept at grouping rows in a pandas DataFrame and applying operations to the groups. You've undoubtedly enhanced your ability to simplify data and extract critical insights!

Next, we've arranged hands-on exercises for you to exercise your newly developed grouping skills. Let's get started!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.