Lesson 1

Mastering DataFrame Merging in Python with Pandas

Introduction to DataFrame and Pandas

Welcome aboard on our enlightening journey through merging DataFrames using pandas in Python! In the real world, data is rarely consolidated in one location. More often, it's spread across several sources, waiting to be collected, organized, and analyzed. Whether dealing with sales data from different regions, healthcare records from multiple facilities, or educational scores from several institutions, joining diverse chunks of data is a daily routine in any data-driven field.

In this lesson, we will learn how to use this powerful tool to combine DataFrames and discover the various merge operations and their usage in different scenarios. With practical examples to guide you, get ready to master the art of merging DataFrames with Python Pandas!

Basic Syntax for Merging DataFrames

We use the merge() function provided by pandas to combine DataFrames. This function combines two DataFrames and returns a captured DataFrame based on a common or shared column. Here's a general example:

Python
1merged_df = df1.merge(df2, on="common_column", how="inner")

In this example, abstract df1 and df2 are merged based on a shared or common column. The argument how="inner" denotes this as an inner merge.

Let's look at specific examples and unpack the four types of merges: inner join, outer join, left join, and right join.

Dataset

For this lesson, we will use the following dataset, stored in two separated dataframes:

Python
1df_books = pd.DataFrame({ 2 "Book_ID": [1, 2, 3, 4, 5], 3 "Book_Title": ['Gatsby', 'Mockingbird', '1984', 'Catcher', 'LOTR'], 4 "Author_ID": [101, 102, 103, None, 112], 5 "Genre": ['Fiction', 'Fiction', 'Fiction', 'Fiction', 'Fantasy'] 6 }) 7 8# creating the DataFrame for Authors 9df_authors = pd.DataFrame({ 10 "Author_ID": [101, 102, 103, 104, 105], 11 "Author_Name": ['F. Fitzgerald', 'H. Lee', 'G. Orwell', 'J. Salinger', 'J. Tolkien'], 12 "Nationality": ['American', 'American', 'British', 'American', 'British'] 13})

Two important things to note: The author with Author_ID=112 is missing in the df_authors dataframe The book named Catcher in the df_books dataframe misses info about its author

Inner Join

An inner join includes rows where there is a match in both DataFrames. Here's how you can perform an inner join:

Python
1# Merge the dataframes - inner merge 2merged_df = df_books.merge(df_authors, on="Author_ID", how="inner") 3print(merged_df) 4'''Output: 5 Book_ID Book_Title Author_ID Genre Author_Name Nationality 60 1 Gatsby 101.0 Fiction F. Fitzgerald American 71 2 Mockingbird 102.0 Fiction H. Lee American 82 3 1984 103.0 Fiction G. Orwell British 9'''

The resultant DataFrame will have only rows with common Author_ID in both dataframes, so we don't include books where author information is missing or undefined.

Outer Join

An outer join includes all the rows from both DataFrames and fills NaN for missing values:

Python
1# Merge - outer merge 2merged_df = df_books.merge(df_authors, on="Author_ID", how="outer") 3print(merged_df) 4'''Output: 5 Book_ID Book_Title Author_ID Genre Author_Name Nationality 60 1.0 Gatsby 101.0 Fiction F. Fitzgerald American 71 2.0 Mockingbird 102.0 Fiction H. Lee American 82 3.0 1984 103.0 Fiction G. Orwell British 93 4.0 Catcher NaN Fiction NaN NaN 104 5.0 LOTR 112.0 Fantasy NaN NaN 115 NaN NaN 104.0 NaN J. Salinger American 126 NaN NaN 105.0 NaN J. Tolkien British 13'''

All data from both dataframes is included. For any missing data, it has NaN. Note how it includes all the books, even if they miss the author info, and all the authors, even if there is no book info for them.

Left Join

A left join includes all rows from the first DataFrame and fills NaN for missing values in the second DataFrame:

Python
1# Merge - left merge 2merged_df = df_books.merge(df_authors, on="Author_ID", how="left") 3print(merged_df) 4'''Output: 5 Book_ID Book_Title Author_ID Genre Author_Name Nationality 60 1 Gatsby 101.0 Fiction F. Fitzgerald American 71 2 Mockingbird 102.0 Fiction H. Lee American 82 3 1984 103.0 Fiction G. Orwell British 93 4 Catcher NaN Fiction NaN NaN 104 5 LOTR 112.0 Fantasy NaN NaN 11'''

After a left merge, the resultant DataFrame includes all books, even if they miss the author info.

Right Join

A right join includes all rows from the second DataFrame, in reverse to a left join. Here's an example of a right join:

Python
1merged_df = df_books.merge(df_authors, on="Author_ID", how="right") 2print(merged_df) 3'''Output: 4 Book_ID Book_Title Author_ID Genre Author_Name Nationality 50 1.0 Gatsby 101.0 Fiction F. Fitzgerald American 61 2.0 Mockingbird 102.0 Fiction H. Lee American 72 3.0 1984 103.0 Fiction G. Orwell British 83 NaN NaN 104.0 NaN J. Salinger American 94 NaN NaN 105.0 NaN J. Tolkien British 10'''

After a right merge, the resultant DataFrame includes all authors, even if there is no book information for them.

Lesson Summary

Through this lesson, you have learned how to handle data-merging scenarios. This lesson has fortified your foundation for a critical aspect in data analysis and feature engineering: combining multiple datasets.

Get ready for more experiential learning in the following exercises, where you will apply the concepts you learned.

Without further ado, let's embark on our hands-on learning journey with interesting coding challenges—here's to happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.