Lesson 3
Mastering Advanced Functions in Pandas: Groupby and Apply for Large-Scale Data Analysis
Introduction to Mastering Pandas: Advanced Functions

Welcome back to our journey toward mastering the advanced concepts in Numpy and Pandas! In previous lessons, we focused on Python basics, delved into Matrix operations in Numpy, and introduced you to Pandas. In this lesson, we aim to take a step further in our Pandas expedition.

Today, we focus on enhancing your Python skills by exploring the advanced functions that Pandas offers — specifically, the groupby and apply methods.

These tools are central to handling large-scale datasets and simplifying complex data analysis maneuvers. To illustrate this, consider a scenario in an eCommerce business: You want to find the total revenue grouped by different product categories. Here, the groupby function can efficiently sort your large sales data by product categories, and the apply function can help calculate the revenue for these categories. Such manipulations are pivotal for efficient data preprocessing, especially in areas like Machine Learning, where understanding the relationships between different data groups can provide valuable insights.

Our goal for today is threefold: to understand the functionalities of groupby and apply, to recognize their role in data transformation, and most importantly, to apply these tools to tackle complex data analysis problems.

Deep Dive into the groupby() Method in Pandas

The groupby method plays a crucial role in Pandas. It helps in grouping large data sets based on specified criteria by following a 'split-apply-combine' approach.

To clarify, consider you are an instructor in a school and want to calculate the average score for each of your students in various subjects. The 'split' phase would involve dividing the students based on their subjects. The 'apply' phase calculates the average for each student, and the 'combine' phase compiles these averages against each specific subject.

In coding parlance, the splitting criterion is defined through keys, which can either be a series of labels or an array of the same length as the axis being grouped. Here's a simple demonstration of the groupby method:

Python
1import pandas as pd 2 3# Create a simple dataframe 4data = {'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'], 5 'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'], 6 'Sales': [200, 120, 340, 124, 243, 350]} 7df = pd.DataFrame(data) 8 9# Apply groupby 10df_grouped = df.groupby('Company') 11for key, item in df_grouped: 12 print("\nGroup Key: {}".format(key)) 13 print(df_grouped.get_group(key), "\n") 14""" 15Group Key: FB 16 Company Person Sales 174 FB Carl 243 185 FB Sarah 350 19 20 21Group Key: GOOG 22 Company Person Sales 230 GOOG Sam 200 241 GOOG Charlie 120 25 26 27Group Key: MSFT 28 Company Person Sales 292 MSFT Amy 340 303 MSFT Vanessa 124 31"""

In the above example, groupby('Company') organizes the DataFrame by its Company column. However, this doesn't display a DataFrame. This is because groupby returns a groupby object that includes many useful methods for performing various operations on these groups. We will explore some of these in the next section.

Unraveling the groupby() Operations

The pronounced benefit of the groupby method is the variety of operations we can perform on the groupby object. Functions like sum(), mean(), etc., help us simplify the grouped data into more insightful information. Here's how we can use groupby and find out the total sales for each company:

Python
1grouped = df.groupby('Company') 2print(grouped.sum()) 3""" 4 Person Sales 5Company 6FB CarlSarah 593 7GOOG SamCharlie 320 8MSFT AmyVanessa 464 9"""

This function will return the sum of all columns (where applicable) for each company in our grouped data. We can effectively dissect our dataset into richer, more insightful information.

Introduction to the Apply Method in Pandas

Once we've split our DataFrame into different groups, it is time to introduce apply(). This function applies a specific function to every member of a sequence, such as a Series or DataFrame, effectively combining groupby() and apply() to conduct intricate data manipulation tasks.

Here's a simplified instance of the apply method:

Python
1import numpy as np 2import pandas as pd 3 4# Create a dataframe 5df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 6 'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 7 'C': np.random.randn(8), 8 'D': np.random.randn(8)}) 9 10# Define a function 11def get_sum(row): 12 return row.sum() 13 14# Apply the function 15df['sum'] = df[['C', 'D']].apply(get_sum, axis=1) 16 17print(df) 18""" 19 A B C D sum 200 foo one -0.343200 0.184665 -0.158535 211 bar one 0.058870 1.835614 1.894484 222 foo two 0.801743 -0.184409 0.617333 233 bar three 0.935406 0.124109 1.059515 244 foo two 0.782074 0.583470 1.365544 255 bar two 0.138934 0.710407 0.849341 266 foo one 0.364633 1.147963 1.512596 277 foo three -1.364677 1.719538 0.354861 28"""

In the example above, we've defined a function, get_sum(), and then used the apply method to apply this function to every row in the dataframe. This operation results in a new 'sum' column which is the sum of 'C' and 'D' for each row.

Leveraging the Power of Apply and Groupby

The apply method can be leveraged most effectively by combining it with groupby. This combination allows us to apply functions not just to each row or column of a DataFrame but also to each group of rows. For instance, let's find the maximum sales for each company:

Python
1print(df.groupby('Company').apply(lambda x: x['Sales'].max())) 2""" 3Company 4FB 350 5GOOG 200 6MSFT 340 7dtype: int64 8"""

In this example, groupby('Company') divides our DataFrame by the Company column. Then apply(lambda x: x['Sales'].max()) applies a lambda function to each group, returning the maximum 'Sales' for each company.

Delving into the California Housing Dataset with Advanced Pandas

With the concepts of apply and groupby under our belt, let's dive into the California Housing dataset and extract valuable insights using these functions.

Here is how we import the California Housing dataset:

Python
1from sklearn.datasets import fetch_california_housing 2import pandas as pd 3 4# Fetch the dataset 5data = fetch_california_housing(as_frame=True) 6 7# create a DataFrame 8housing_df = pd.DataFrame(data=data.data, columns=data.feature_names)

In the above example, fetch_california_housing(as_frame=True) fetches the dataset as a DataFrame. The comprehensive dataset contains houses' values from California, as well as other corresponding features such as median income, average occupancy, etc.

Advanced Data Analysis

Now, let's apply all our learning to solve a complex problem: calculating the average population for each income category. To do this, we first need to categorize incomes into different categories, which is where the function pd.cut() comes in. It segments and sorts data values into bins. Then groupby() will group our DataFrame by these income categories, and finally, apply() will calculate the average population for each group. Here's the code:

Python
1# Define income category 2housing_df['income_cat'] = pd.cut(housing_df['MedInc'], 3 bins=[0., 1.5, 3.0, 4.5, 6., np.inf], 4 labels=[1, 2, 3, 4, 5]) 5 6# Group by income category and calculate the average population 7average_population = housing_df.groupby('income_cat').apply(lambda x: x['Population'].mean()) 8 9print(average_population) 10""" 11income_cat 121 1105.806569 132 1418.232336 143 1448.062465 154 1488.974718 165 1389.890347 17dtype: float64 18"""

In this snippet, pd.cut() segments the median income into different categories, which are labeled from 1 to 5. groupby('income_cat') then groups the DataFrame by these income categories, and apply(lambda x: x['Population'].mean()) calculates the average population for each income category.

Lesson Summary

In this lesson, we've delved deeper into the forest of powerful functionalities of Pandas, like the groupby and apply methods. We've explored their roles in transforming data, seen them in action, and applied these tools to solve complex data analysis problems.

Our journey included a detour through the confirmatory terrain of the California Housing dataset, showcasing how to harness our data analysis skills to extract valuable insights.

The knowledge acquired and hands-on experience from manipulating a large dataset should enhance your abilities to utilize these tools to simplify and accelerate your data analysis and preprocessing tasks.

Ready for Practice?

We've dissected the theory, illuminated the dark corners, and worked through examples using these advanced Pandas functions. Now, it's time to dive deeper with hands-on practice exercises on CodeSignal. These exercises will give you firsthand experience solving unique, real-world problems using these methods. So gear up, and remember, the path to success is paved with practice! Happy Learning!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.