Welcome! Today, we'll dive into merging datasets. Imagine you have two sets of information about the same group of people or things, and you want to combine them into one set. This is important in data analysis because it helps you enrich your data and discover more insights.
By the end of this lesson, you'll understand:
pd.merge()
function.Merging datasets is like combining two puzzles. In Python, we use the pd.merge()
function from Pandas to do this. Let's create two simple datasets:
Python1import pandas as pd 2 3# Dataset 1: Basic information about some students 4students = pd.DataFrame({ 5 'student_id': [1, 2, 3], 6 'name': ['Alice', 'Bob', 'Charlie'], 7 'age': [15, 16, 17] 8}) 9 10# Dataset 2: Additional performance information about the same students 11performance = pd.DataFrame({ 12 'student_id': [1, 3, 4], 13 'grade': ['A', 'B', 'A'], 14 'attendance': [95, 85, 100] 15})
Now that we have two datasets, let's combine them using pd.merge()
. We will merge these datasets based on the common column student_id
.
Python1# Merging the datasets on 'student_id' 2students_merged = pd.merge(students, performance, on='student_id', how='left') 3print(students_merged) 4# student_id name age grade attendance 5# 0 1 Alice 15 A 95.0 6# 1 2 Bob 16 NaN NaN 7# 2 3 Charlie 17 B 85.0
Explanation:
students
: Contains basic information about students.performance
: Contains additional info like grades and attendance.pd.merge(students, performance, on='student_id', how='left')
: Merges the two DataFrames based on student_id
. The how='left'
parameter keeps all rows from the left DataFrame (students
).Merging datasets can be done in various ways, depending on how much and what kind of data you want to include. Here are the most common types:
Let's see examples of each type:
Python1# Left Join 2left_join = pd.merge(students, performance, on='student_id', how='left') 3print(left_join) 4# student_id name age grade attendance 5# 0 1 Alice 15 A 95.0 6# 1 2 Bob 16 NaN NaN 7# 2 3 Charlie 17 B 85.0
Python1# Right Join 2right_join = pd.merge(students, performance, on='student_id', how='right') 3print(right_join) 4# student_id name age grade attendance 5# 0 1 Alice 15.0 A 95 6# 1 3 Charlie 17.0 B 85 7# 2 4 NaN NaN A 100
Continuing from part 1, let's cover two more common join types:
Python1# Inner Join 2inner_join = pd.merge(students, performance, on='student_id', how='inner') 3print(inner_join) 4# student_id name age grade attendance 5# 0 1 Alice 15 A 95 6# 1 3 Charlie 17 B 85
As you can see, student ids 2
and 4
are not present, as they are missing from at least one of the datasets.
Python1# Outer Join 2outer_join = pd.merge(students, performance, on='student_id', how='outer') 3print(outer_join) 4# student_id name age grade attendance 5# 0 1 Alice 15.0 A 95.0 6# 1 2 Bob 16.0 NaN NaN 7# 2 3 Charlie 17.0 B 85.0 8# 3 4 NaN NaN A 100.0
Now, all the student ids are present.
Great job! Today, we've learned how important merging datasets is when dealing with multiple sources of information. We covered:
pd.merge()
.left
, right
, inner
, outer
).You're now ready to move to the practice part. In the exercises, you'll get hands-on experience with merging datasets. Combining different pieces of information can help you gain better insights and make more informed decisions. Let's get started!