Hello! In this lesson, we will explore the concept of grouping in pandas. Grouping rows of a DataFrame is a powerful tool that allows you to aggregate rows based on the values in one or more columns of your data. For example, say you have a DataFrame representing all orders in an online store, where each row is an order. You could group the DataFrame by Customer_ID
to glean data about particular shoppers. Similarly, if you have a school database where each row corresponds to a student, grouping by Grade_Level
can streamline data analysis for each grade. We'll exemplify this operation using a straightforward DataFrame.
Let's work with a DataFrame of individuals, each characterized by attributes such as Name
, Age
, and City
. Here is a simple example:
Python1import pandas as pd 2df = pd.DataFrame({ 3 "Name": ["Alex", "Bob", "Chloe", "Charlie", "Alex", "Charlie"], 4 "Age": [12, 15, 28, 55, 21, 35], 5 "City": ["New York", "Los Angeles", "Chicago", "Los Angeles", "New York", "New York"] 6})
The groupby
function in pandas is the basis of group operations. Imagine a pond filled with various types of fish. When different colored foods are dropped into the pond, every fish is drawn to a specific color. After some time, your pond will neatly sort into groups of each type of fish.
We take the same approach with our DataFrame. Using the pandas groupby
method, we will group our data by City
. Here's how it's done:
Python1grouped = df.groupby("City")
Now, the grouped
variable is a special pandas DataFrameGroupBy
object that has divided our DataFrame into groups by city.
The DataFrameGroupBy
object holds the groups created using groupby
. It's like a dictionary: each key is a unique city from our City
column, and the corresponding value for each key is a DataFrame comprising all rows with that city in the City
column.
For instance, to view data for "New York" only, pandas offers the get_group
function:
Python1print(grouped.get_group("New York")) 2'''Output: 3 Name Age City 40 Alex 12 New York 54 Alex 21 New York 65 Charlie 35 New York 7'''
And there you have it! The function returns a DataFrame containing only the individuals whose City
is "New York".
But we're still going! After dividing our DataFrame into groups, we can perform operations on these groups separately. A typical group operation uses aggregate functions, which take a group of values, calculate, and return a single result. Common examples include taking the sum, average (mean
), maximum value (max
), or minimum value (min
) of a group.
After grouping, we can calculate the average age of inhabitants for each city:
Python1print(grouped['Age'].mean()) 2'''Output: 3City 4Chicago 28.000000 5Los Angeles 35.000000 6New York 22.666667 7'''
Note how pandas skillfully calculates the mean
for each group separately!
Imagine we want to do something more complex than simply calculate the mean
value for each group. In this case, we might want to iterate through all the groups. It is done easily with a regular for
loop:
Python1for name, group in grouped: 2 print("\nCity:", name) 3 print("Number of people:", len(group)) 4 print("Average age:", group["Age"].mean()) 5'''Output: 6City: Chicago 7Number of people: 1 8Average age: 28.0 9 10City: Los Angeles 11Number of people: 2 12Average age: 35.0 13 14City: New York 15Number of people: 3 16Average age: 22.666666666666668 17'''
Bravo! Now, you are adept at grouping rows in a pandas DataFrame and applying operations to the groups. You've undoubtedly enhanced your ability to simplify data and extract critical insights!
Next, we've arranged hands-on exercises for you to exercise your newly developed grouping skills. Let's get started!