Welcome back! Before we dive into the specifics of manipulating and transforming your data in R, let's recap what we've covered so far. We started with essential R programming concepts, then moved on to acquiring and preparing data for analysis. Now, it's time to take those clean datasets and learn how to manipulate and transform them effectively to uncover insights.
In this lesson, you will master various data manipulation and transformation techniques using the powerful dplyr
library. Specifically, you'll learn how to:
Here's a step-by-step breakdown of the techniques you'll be practicing:
We'll create a sample data frame to demonstrate various data manipulation techniques.
R1# Create a sample data frame 2df <- data.frame( 3 ID = 1:5, 4 Name = c("John", "Jane", "Doe", "Smith", "Emily"), 5 Score = c(85, 90, 88, 77, 95) 6)
Subsetting allows you to focus on specific columns or rows in your dataset. Here, we'll use the dplyr
library to select the Name
and Score
columns.
R1# Subsetting Data: Select Name and Score columns 2selected_data <- dplyr::select(df, Name, Score) 3print("Selected Data") 4print(selected_data)
The select
function from dplyr
is used to extract specific columns from the data frame. In this case, we are selecting the Name
and Score
columns.
Syntax of select
function:
select(data_frame, column1, column2, ...)
data_frame
: The data frame from which you want to select columns.column1, column2, ...
: The names of the columns you want to select.Filtering helps you keep only the rows that meet certain conditions. In this example, we'll filter the rows where the Score
is greater than 80.
R1# Filtering Data: Keep rows where Score is greater than 80 2filtered_data <- dplyr::filter(df, Score > 80) 3print("Filtered Data") 4print(filtered_data)
The filter
function from dplyr
is used to extract rows that meet a specified condition. We are keeping rows where the Score
is greater than 80.
Syntax of filter
function:
filter(data_frame, condition)
data_frame
: The data frame from which you want to filter rows.condition
: The condition that the rows must meet to be included in the output.The mutate
function allows you to add new columns or modify existing ones. Here, we'll add a new column indicating the Score
increased by 10.
R1# Mutating Data: Add a new column showing Score increased by 10 2mutated_data <- dplyr::mutate(df, ScorePlus10 = Score + 10) 3print("Mutated Data") 4print(mutated_data)
The mutate
function from dplyr
is used to create new variables or modify existing ones. We are adding a new column ScorePlus10
, which is the Score
increased by 10.
Syntax of mutate
function:
mutate(data_frame, new_column_name = expression)
data_frame
: The data frame to which you want to add or modify columns.new_column_name
: The name of the new column.expression
: The expression used to compute the values of the new column.Aggregation helps summarize data by categories. We'll calculate the mean Score
for each ID
.
R1# Aggregating Data: Calculate the mean Score for each ID 2grouped_summary <- df %>% dplyr::group_by(ID) %>% dplyr::summarize(mean_score = mean(Score)) 3print("Grouped Summary") 4print(grouped_summary)
Explanation: The group_by
function from dplyr
creates groups within the data, and the summarize
function calculates summary statistics for each group. Here, we are computing the mean Score
for each ID
.
Syntax of group_by
and summarize
functions:
group_by(data_frame, grouping_column)
data_frame
: The data frame to group.grouping_column
: The column by which to group the data.summarize(grouped_data_frame, summary_column_name = summary_function(column_to_summarize))
grouped_data_frame
: The grouped data frame.summary_column_name
: The name of the new column containing the summary statistic.summary_function(column_to_summarize)
: The function and column used to compute the summary statistic (e.g., mean(Score)
).Notes on %>%
operator:
%>%
operator, also known as the pipe operator, is a feature in the magrittr
package, which is loaded by default with dplyr
.%>%
is used to chain group_by(ID)
and summarize(mean_score = mean(Score))
, effectively creating a pipeline of operations on df
.Being proficient at data manipulation and transformation is crucial for any data scientist. Here’s why:
By the end of this lesson, you'll be able to manipulate and transform data confidently, setting the stage for advanced analysis and visualization. Ready to take the next step? Let’s start the practice section and transform our data!