Welcome to today's lesson on grouping data frames and performing analyses. Most real-world data is chaotic. Grouping data enables us to analyze large datasets. By grouping data, slicing information at the macro or micro level becomes a breeze. Let's delve further into this.
Grouping data means analyzing it through the lens of certain categories. In R, group_by()
from dplyr
aids us in doing this. Consider a dataset sales_df
that comprises sales information for different products. If we group it by product_name
, we can compare products without turning the analysis into an apples-to-oranges comparison.
R1library(dplyr) # Loading the library 2 3# Creating a sample sales dataset 4sales_df <- data.frame( 5 product_name = c('Widget A', 'Widget B', 'Gadget A', 'Gadget B', 'Widget A', 'Widget B', 'Gadget A', 'Gadget B'), 6 qty_sold = c(15, 20, 10, 5, 25, 30, 5, 10), 7 price = c(2.50, 3.00, 1.95, 2.25, 2.45, 3.10, 2.00, 2.50) 8) 9 10grouped_df <- group_by(sales_df, product_name) # Grouping by product name
The grouped_df
contains an object that knows how to work with different groups in data. We can print it, but it won't differ from the original sales_df
. The difference is in the inner structure, which allows us to use a magical summarize
function.
Grouping data is the initial step. Once data is grouped, we can execute various operations like summarizing, finding the minimum and maximum values, calculating mean and median, among other operations, using the summarize()
function. We chain summarize()
to grouped_df
using %>%
.
The %>% operator, known as the pipe operator, passes the result of one function directly as an argument to the next function. This makes your code easy to read and efficient. Instead of nesting functions inside each other, you can write a sequence of operations in a more linear, readable manner.
R1# Summarizing `grouped_df` with total sales and average price per product 2sales_summary <- sales_df %>% 3 group_by(product_name) %>% 4 summarize(total_sales = sum(qty_sold), # Sum of qty sold for each category 5 average_price = mean(price, na.rm=TRUE)) # Average price for each category 6 # na.rm=True is for ignoring NA values
The result is:
1 product_name total_sales average_price 2 <chr> <dbl> <dbl> 31 Gadget A 15 1.98 42 Gadget B 15 2.38 53 Widget A 40 2.48 64 Widget B 50 3.05
It calculates the total sold quantity and average price for each category. Note how the pipe operator chains group_by
and summarize
functions.
You have now learned about data grouping and analysis, and have become proficient with group_by
and summarize()
. We also used %>%
to chain our functions in R. Now, it's time for you to put these skills into practice. Happy learning!