Lesson 1
Merging Datasets
Lesson Introduction

Welcome! Today, we'll dive into merging datasets. Imagine you have two sets of information about the same group of people or things, and you want to combine them into one set. This is important in data analysis because it helps you enrich your data and discover more insights.

By the end of this lesson, you'll understand:

  1. How to merge datasets using the pd.merge() function.
  2. Different ways to merge datasets: left, right, inner, and outer joins.
  3. Practical examples to connect the theory with real-life scenarios.
Introduction to `pd.merge()`

Merging datasets is like combining two puzzles. In Python, we use the pd.merge() function from Pandas to do this. Let's create two simple datasets:

Python
1import pandas as pd 2 3# Dataset 1: Basic information about some students 4students = pd.DataFrame({ 5 'student_id': [1, 2, 3], 6 'name': ['Alice', 'Bob', 'Charlie'], 7 'age': [15, 16, 17] 8}) 9 10# Dataset 2: Additional performance information about the same students 11performance = pd.DataFrame({ 12 'student_id': [1, 3, 4], 13 'grade': ['A', 'B', 'A'], 14 'attendance': [95, 85, 100] 15})
Example with Two DataFrames

Now that we have two datasets, let's combine them using pd.merge(). We will merge these datasets based on the common column student_id.

Python
1# Merging the datasets on 'student_id' 2students_merged = pd.merge(students, performance, on='student_id', how='left') 3print(students_merged) 4# student_id name age grade attendance 5# 0 1 Alice 15 A 95.0 6# 1 2 Bob 16 NaN NaN 7# 2 3 Charlie 17 B 85.0

Explanation:

  • students: Contains basic information about students.
  • performance: Contains additional info like grades and attendance.
  • pd.merge(students, performance, on='student_id', how='left'): Merges the two DataFrames based on student_id. The how='left' parameter keeps all rows from the left DataFrame (students).
Different Types of Joins: Part 1

Merging datasets can be done in various ways, depending on how much and what kind of data you want to include. Here are the most common types:

  1. Left Join: All rows from the left DataFrame, matched rows from the right DataFrame.
  2. Right Join: All rows from the right DataFrame, matched rows from the left DataFrame.

Let's see examples of each type:

Python
1# Left Join 2left_join = pd.merge(students, performance, on='student_id', how='left') 3print(left_join) 4# student_id name age grade attendance 5# 0 1 Alice 15 A 95.0 6# 1 2 Bob 16 NaN NaN 7# 2 3 Charlie 17 B 85.0
Python
1# Right Join 2right_join = pd.merge(students, performance, on='student_id', how='right') 3print(right_join) 4# student_id name age grade attendance 5# 0 1 Alice 15.0 A 95 6# 1 3 Charlie 17.0 B 85 7# 2 4 NaN NaN A 100
Different Types of Joins: Part 2

Continuing from part 1, let's cover two more common join types:

  1. Inner Join: Only rows that match in both DataFrames.
  2. Outer Join: All rows from both DataFrames, matched where possible.
Python
1# Inner Join 2inner_join = pd.merge(students, performance, on='student_id', how='inner') 3print(inner_join) 4# student_id name age grade attendance 5# 0 1 Alice 15 A 95 6# 1 3 Charlie 17 B 85

As you can see, student ids 2 and 4 are not present, as they are missing from at least one of the datasets.

Python
1# Outer Join 2outer_join = pd.merge(students, performance, on='student_id', how='outer') 3print(outer_join) 4# student_id name age grade attendance 5# 0 1 Alice 15.0 A 95.0 6# 1 2 Bob 16.0 NaN NaN 7# 2 3 Charlie 17.0 B 85.0 8# 3 4 NaN NaN A 100.0

Now, all the student ids are present.

Lesson Summary

Great job! Today, we've learned how important merging datasets is when dealing with multiple sources of information. We covered:

  1. The basics of merging datasets using pd.merge().
  2. Types of joins (left, right, inner, outer).
  3. Practical examples to see how merging can be applied in real-life situations.

You're now ready to move to the practice part. In the exercises, you'll get hands-on experience with merging datasets. Combining different pieces of information can help you gain better insights and make more informed decisions. Let's get started!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.