Welcome back! In the previous lessons, we've explored how to select, rename, filter, slice, mutate, and relocate columns and rows in your data using the dplyr
package. These techniques have provided you with a solid foundation for data manipulation. In this lesson, we will extend that knowledge by delving into two more powerful functions: summarize and group_by.
In this lesson, you will learn about the summarize
and group_by
functions in dplyr
. These tools enable you to transform your data into meaningful summaries and analyze trends within subgroups.
Here’s a taste of what you’ll be working with, using a sample data frame similar to past examples:
R1suppressPackageStartupMessages(library(dplyr)) 2 3# Sample dataframe 4sample_df <- tibble( 5 ID = 1:5, 6 Name = c("John", "Jane", "Alex", "Emily", "David"), 7 Age = c(28, 22, 35, 29, 40), 8 Salary = c(50000, 60000, 70000, 80000, 90000) 9) 10 11# Summarize to get average salary 12avg_salary <- sample_df %>% summarize(AvgSalary = mean(Salary)) 13print("Average salary of the dataframe:") 14print(avg_salary) 15 16# Group by age > 30 and summarize average salary 17grouped_summary <- sample_df %>% group_by(Age > 30) %>% summarize(AvgSalary = mean(Salary)) 18print("Average salary grouped by Age > 30:") 19print(grouped_summary)
Note: Using multiple %>%
operators is called "chaining" and we'll explore this concept in more detail in the next unit.
You will learn how to:
Being able to summarize and group data is fundamental for deriving insights and making data-driven decisions. By combining these functions, you can transform raw data into insightful summaries that are crucial for any data analysis task. These skills will be invaluable whether you’re working with small datasets or large-scale data projects.
Excited to dive into summarizing and grouping data? Let’s start the practice section and put these powerful tools to work!