Lesson 1

Merging Data Frames in R with dplyr

Introduction to Data Frames and dplyr

Welcome aboard our enlightening journey through merging data frames using dplyr in R! In the real world, data are rarely consolidated in one location. Often, they're spread across multiple sources, waiting to be collected, organized, and analyzed. Whether we're dealing with sales data from various regions, healthcare records from a multitude of facilities, or educational scores from several institutions, joining diverse chunks of data is a routine task in any data-driven field.

In this lesson, we will learn how to use this powerful tool to combine data frames, and discover various merge operations and their usage in different scenarios. With practical examples to guide you, get ready to master the art of merging data frames with R dplyr!

Basic Syntax for Merging Data Frames

We utilize the join() functions from dplyr to combine data frames. Here's a general example:

R
1library(dplyr) 2 3merged_df <- inner_join(df1, df2, by = "common_column")

In these examples, the abstract variables df1 and df2 are merged based on a shared or common column.

We shall look at specific examples and unpack the four types of merges: inner join, outer join, left join, and right join.

Dataset

For this lesson, we will use the following dataset, stored in two separate data frames:

R
1df_books <- data.frame( 2 "Book_ID" = c(1, 2, 3, 4, 5), 3 "Book_Title" = c('Gatsby', 'Mockingbird', '1984', 'Catcher', 'LOTR'), 4 "Author_ID" = c(101, 102, 103, NA, 112), 5 "Genre" = c('Fiction', 'Fiction', 'Fiction', 'Fiction', 'Fantasy') 6 ) 7 8# creating the DataFrame for Authors 9df_authors <- data.frame( 10 "Author_ID" = c(101, 102, 103, 104, 105), 11 "Author_Name" = c('F. Fitzgerald', 'H. Lee', 'G. Orwell', 'J. Salinger', 'J. Tolkien'), 12 "Nationality" = c('American', 'American', 'British', 'American', 'British') 13)

Two important things to note:

  1. The author with Author_ID=112 is missing in the df_authors data frame.
  2. The book named Catcher in the df_books data frame has missing info about its author.
Inner Join

An inner join includes rows where there is a match in both data frames. The following examples show how you can perform an inner join:

R
1# Merge the dataframes - inner merge 2merged_df <- inner_join(df_books, df_authors, by = "Author_ID") # With dplyr

The resultant data frame will include only rows with common Author_ID in both data frames, so books without author information will not be included.

1 Book_ID Book_Title Author_ID Genre Author_Name Nationality 21 1 Gatsby 101 Fiction F. Fitzgerald American 32 2 Mockingbird 102 Fiction H. Lee American 43 3 1984 103 Fiction G. Orwell British
Outer Join

An outer join includes all the rows from both data frames, filling in NA for missing values:

R
1# Merge - outer merge 2merged_df <- full_join(df_books, df_authors, by = "Author_ID")

This approach includes all data from both data frames. In the case of any missing data, it adds NA. Note how it includes all the books, even if they lack author info, and all authors, even if there is no book info for them.

1 Book_ID Book_Title Author_ID Genre Author_Name Nationality 21 1 Gatsby 101 Fiction F. Fitzgerald American 32 2 Mockingbird 102 Fiction H. Lee American 43 3 1984 103 Fiction G. Orwell British 54 4 Catcher NA Fiction <NA> <NA> 65 5 LOTR 112 Fantasy <NA> <NA> 76 NA <NA> 104 <NA> J. Salinger American 87 NA <NA> 105 <NA> J. Tolkien British
Left Join

A left join includes all rows from the first data frame and fills in NA for missing values in the second data frame:

R
1# Merge - left merge 2merged_df <- left_join(df_books, df_authors, by = "Author_ID")

After performing a left merge, the resultant data frame includes all books, even those missing the author's info.

1 Book_ID Book_Title Author_ID Genre Author_Name Nationality 21 1 Gatsby 101 Fiction F. Fitzgerald American 32 2 Mockingbird 102 Fiction H. Lee American 43 3 1984 103 Fiction G. Orwell British 54 4 Catcher NA Fiction <NA> <NA> 65 5 LOTR 112 Fantasy <NA> <NA>
Right Join

A right join includes all rows from the second data frame, similar to a left join in reverse:

R
1merged_df = right_join(df_books, df_authors, by = "Author_ID")

After a right merge, the resultant data frame includes all authors, even if there is no corresponding book information.

1 Book_ID Book_Title Author_ID Genre Author_Name Nationality 21 1 Gatsby 101 Fiction F. Fitzgerald American 32 2 Mockingbird 102 Fiction H. Lee American 43 3 1984 103 Fiction G. Orwell British 54 NA <NA> 104 <NA> J. Salinger American 65 NA <NA> 105 <NA> J. Tolkien British
Lesson Summary

Through this lesson, you have learned how to handle data-merging scenarios. This lesson has strengthened your foundation in a critical aspect of data analysis and feature engineering: combining multiple datasets.

Prepare yourself for more experiential learning in the following exercises, where you'll apply the concepts you've learned.

Without further ado, let's embark on our hands-on learning journey with exciting coding challenges — here's to happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.