Hello, fellow explorer! Today, we are going to delve into another exciting segment of your data science expedition: Data Filtering and Sorting with Pandas. You'll learn how to narrow down your data to match certain criteria and arrange it in a particular order. This is a fundamental skill when handling data, enabling us to extract valuable information quickly and efficiently.
In the real world, data analysis isn't about dealing with entire datasets but concerning yourself with specific slices of it. For instance, in our Titanic
dataset, you might be interested in passengers who survived or those within a certain age group. How about arranging the data based on Fare
or Age
? That's where data filtering and sorting come into play!
Without further ado, let's get into the practical side of things. We'll commence by introducing data filtering, a powerful tool that allows you to extract a subset of your data that meets certain conditions.
Suppose you're interested in data related to passengers who survived the Titanic disaster. How would you extract this data? With Pandas, you can do this using boolean indexing. Here's how it works:
Python1import seaborn as sns 2import pandas as pd 3 4# Load dataset 5titanic_df = sns.load_dataset('titanic') 6 7# Filter passengers who survived 8survivors = titanic_df[titanic_df['survived'] == 1] 9print(survivors.head()) 10 11""" 12 survived pclass sex age ... deck embark_town alive alone 131 1 1 female 38.0 ... C Cherbourg yes False 142 1 3 female 26.0 ... NaN Southampton yes True 153 1 1 female 35.0 ... C Southampton yes False 168 1 3 female 27.0 ... NaN Southampton yes False 179 1 2 female 14.0 ... NaN Cherbourg yes False 18 19[5 rows x 15 columns] 20"""
In this code, the titanic_df['survived'] == 1
creates a boolean mask, a sequence of True
and False
, where True
corresponds to passengers who survived and False
to those who didn't. When applied to the DataFrame, it returns only the rows where the mask is True
, that is, the survivors' data.
Once we have our filtered data, it's often useful to sort it based on a particular column. For example, we might want to order the survivors' data by age. To do this, we'll use Pandas sort_values()
method:
Python1# Sort survivors by age 2sorted_df = survivors.sort_values('age') 3print(sorted_df.head()) 4 5""" 6 survived pclass sex age ... deck embark_town alive alone 7803 1 3 male 0.42 ... NaN Cherbourg yes False 8755 1 2 male 0.67 ... NaN Southampton yes False 9644 1 3 female 0.75 ... NaN Cherbourg yes False 10469 1 3 female 0.75 ... NaN Cherbourg yes False 11831 1 2 male 0.83 ... NaN Southampton yes False 12 13[5 rows x 15 columns] 14"""
The sort_values()
method arranges the DataFrame in ascending order of the column passed to it as an argument. In our case, it's the age
column. The head()
function then displays the first 5 rows of the sorted DataFrame.
Sometimes, sorting by a single column isn't enough. For instance, what if you want to sort by class and then age within each class? That's where multiple-column sorting comes in. Let's sort our DataFrame first by class ('pclass') in descending order, then by age within each class in ascending order.
Python1# Sort survivors by class and age 2sorted_df = survivors.sort_values(['pclass', 'age'], ascending=[False, True]) 3print(sorted_df.head()) 4 5""" 6 survived pclass sex age ... deck embark_town alive alone 7803 1 3 male 0.42 ... NaN Cherbourg yes False 8469 1 3 female 0.75 ... NaN Cherbourg yes False 9644 1 3 female 0.75 ... NaN Cherbourg yes False 10172 1 3 female 1.00 ... NaN Southampton yes False 11381 1 3 female 1.00 ... NaN Cherbourg yes False 12 13[5 rows x 15 columns] 14"""
In this case, we are passing a list of column names to the sort_values()
function and defining the sort order for each column with ascending=[False, True]
. This tells pandas to sort by 'pclass' in descending order (from third class to first class) and then sort each class by age
in ascending order (from youngest to oldest within each class).
However, real-world scenarios often require us to filter data using more complex conditions. For instance, you might want data on female passengers who survived. You can achieve this by combining conditions.
Python1# Filter female passengers who survived 2female_survivors = titanic_df[ 3 (titanic_df['survived'] == 1) & (titanic_df['sex'] == 'female') 4] 5print(female_survivors.head()) 6 7""" 8 survived pclass sex age ... deck embark_town alive alone 91 1 1 female 38.0 ... C Cherbourg yes False 102 1 3 female 26.0 ... NaN Southampton yes True 113 1 1 female 35.0 ... C Southampton yes False 128 1 3 female 27.0 ... NaN Southampton yes False 139 1 2 female 14.0 ... NaN Cherbourg yes False 14 15[5 rows x 15 columns] 16"""
In this code snippet, &
stands for the logical AND
operator. Thus, the code filters data for passengers who survived ('survived' == 1
) and who are female ('sex' == 'female'
). The resulting DataFrame, female_survivors
, contains information only about women who survived the tragedy.
And that's it for today's session! Give yourself a pat on the back—you've learned how to filter and sort data using pandas. With these skills, you can handle, manipulate, and retrieve data more proficiently.
We covered the basics of data filtering in Pandas using boolean indexing and sorting a DataFrame by a single column. We dove into how to sort by multiple columns and how filtering can employ multiple conditions, providing more flexibility in pinpointing the data you need.
As your next step, get ready for some hands-on practice to reinforce your understanding and gain confidence in applying these newly learned concepts!
Your journey in this vast realm of data manipulation is just beginning. In the next lesson, we'll cover more advanced topics in filtering and sorting, including working with null values and sorting by more than one column.
Now, let's set sail into the practice exercises to reinforce your understanding. Remember, there's no substitute for practicing your coding skills. Happy coding!