Lesson 3

In this lesson, we'll explore **data aggregation**, a powerful tool in data analysis. Aggregation helps you summarize and simplify large sets of data to gain insights quickly. By the end, you'll know how to use data aggregation techniques to find specific information about groups in your dataset.

Data aggregation involves combining, summarizing, or consolidating data points into a single representation. Imagine you have a large set of student test scores. Instead of looking at every individual score, you might want to know the average score for each class. This simplifies your data and helps you see the bigger picture.

Common functions used in data aggregation include:

**Maximum (**: Finds the highest value in a group.`max`

)**Mean (**: Calculates the average value of a group.`mean`

)**Sum (**: Calculates the total sum of a group.`sum`

)**Standard Deviation (**: Calculates the dispersion or spread of a group.`std`

)

Let's start with a simple example:

Python`1scores = [89, 76, 92, 54, 88] 2print("Maximum score:", max(scores)) # Maximum score: 92 3print("Average score:", sum(scores) / len(scores)) # Average score: 79.8`

Here, `max`

gives us the highest score, and calculating the average (mean) summarizes the test scores.

**Pandas** offers an easy way to perform data aggregation using `groupby`

and `agg`

methods. Here’s how you can use them:

: Splits the data into groups based on some criteria.`groupby`

: Applies one or more aggregation functions to these groups.`agg`

Let’s see this in action.

We'll use an example dataset containing information about different products sold in various stores. Here's a small sample:

Python`1import pandas as pd 2 3data = { 4 'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C'], 5 'product': ['Apples', 'Bananas', 'Apples', 'Bananas', 'Apples'], 6 'units_sold': [30, 50, 40, 35, 90], 7 'price': [1.20, 0.50, 1.00, 0.50, 1.30] 8} 9 10df = pd.DataFrame(data) 11print(df)`

`1 store product units_sold price 20 Store A Apples 30 1.20 31 Store A Bananas 50 0.50 42 Store B Apples 40 1.00 53 Store B Bananas 35 0.50 64 Store C Apples 90 1.30`

Let’s find the maximum units sold and the average price of products by store using Pandas.

**Create the Aggregation Dictionary**:

- This maps column names to their aggregation functions.

Python`1agg_funcs = {'units_sold': 'max', 'price': 'mean'}`

**Group Data by Store and Apply the Aggregation Functions**:

Python`1result = df.groupby('store').agg(agg_funcs) 2print(result)`

`1 units_sold price 2store 3Store A 50 0.85 4Store B 40 0.75 5Store C 90 1.30`

Breaking this down:

`groupby('store')`

groups rows by the`store`

column.`agg(agg_funcs)`

applies the specified functions (`max`

for`units_sold`

,`mean`

for`price`

) to each group.

This tells us that in Store A, the most units sold for any product was 50, and the average price of products in that store is $0.85.

In this lesson, we've covered:

- What data aggregation is and why it's useful.
- Common aggregation functions like
`max`

and`mean`

. - How to use Python's
`Pandas`

library to aggregate data effectively.

By understanding and applying these concepts, you're now better equipped to summarize and analyze large datasets.

Now it's your turn to practice data aggregation. You'll get a chance to apply these techniques on different datasets to strengthen your understanding and skills. Good luck!