Lesson 4
Filtering Data in Pandas
Introduction to Filtering Data

Hello, friend! Today's topic is Filtering Data. It's about focusing on the data that matters to us. We'll use pandas, a Python library, to help us with this.

The goal? Master data filtering in pandas. By the end, you'll be able to pick the necessary data from a big data set.

Basics of Data Filtering

Filtering data in pandas is like finding your favorite outfit in a wardrobe. The easiest way to filter data is by columns. Let's illustrate this using a DataFrame of students' details.

Python
1import pandas as pd 2 3# Data of students 4data = { 5 'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve'], 6 'age': [12, 13, 14, 13, 12], 7 'grade_level': [6, 7, 8, 7, 6] 8} 9 10students_df = pd.DataFrame(data) 11 12# Filter 7th grade students 13grade_seven_students = students_df[students_df['grade_level'] == 7] 14 15print(grade_seven_students) 16# Outputs: 17# name age grade_level 18# 1 Bob 13 7 19# 3 Dave 13 7

The code above creates a DataFrame and selects only the rows where the grade_level is 7. Now, you have the data of the 7th-grade students. Note that it works exactly the same as the numpy's boolean selection. Let's recall how it works under the hood.

Understanding Boolean Masking

One of the magic tricks of pandas is Boolean masking. Boolean is a True or False data type. "Mask" means to hide. A Boolean mask hides parts of your data based on it being True or False.

We can create a Boolean Series, a list of True or False values, in pandas and use it for filtering.

Python
1# Boolean Series for 7th grade 2is_grade_seven = students_df['grade_level'] == 7 3print(is_grade_seven) 4# 0 False 5# 1 True 6# 2 False 7# 3 True 8# 4 False

This code creates a Boolean Series checking where the grade_level is 7. Then, it filters the data using this series:

Python
1# Filtering using Boolean Series 2grade_seven_students = students_df[is_grade_seven] 3 4print(grade_seven_students) 5# Outputs: 6# name age grade_level 7# 1 Bob 13 7 8# 3 Dave 13 7

Note that only students with True in the boolean series were selected.

Advanced Data Filtering

Sometimes we need to filter data using multiple conditions. Python lets us do this with logical operators: And (&), Or (|), and Not (~). Let's check them out:

Python
1# Filter 7th grade students who are 13 years old 2grade_seven_and_thirteen = students_df[(students_df['grade_level'] == 7) & (students_df['age'] == 13)] 3 4print(grade_seven_and_thirteen) 5# Outputs: 6# name age grade_level 7# 1 Bob 13 7 8# 3 Dave 13 7

The isin() method in pandas is another wonderful tool. It checks whether a pandas Series is in a list of values.

Python
1# Filter students who are in 6th or 7th grade 2middle_school_students = students_df[students_df['grade_level'].isin([6, 7])] 3 4print(middle_school_students) 5# Outputs: 6# name age grade_level 7# 0 Alice 12 6 8# 1 Bob 13 7 9# 3 Dave 13 7 10# 4 Eve 12 6

Fantastic! Now you know advanced data filtering techniques.

Lesson Summary

This lesson covered basic to advanced data filtering, including Boolean masking and multiple conditions in filtering. Keep practicing these skills on different datasets. Remember, practice makes perfect. Stay tuned for the next lesson!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.