Welcome to our lesson on handling categorical data with Pandas! We're diving into a critical aspect of data manipulation. Data comes in various types, and one of the crucial types is categorical data — data divided into specific categories.
By the end of this lesson, you'll understand how to convert columns in a DataFrame
to categorical types, why it's important, and how to verify the conversion. We'll also see an example of encoding categorical data efficiently. Let's get started!
Categorical data can be divided into groups or categories. It's like sorting toys into different bins: one for cars, one for dolls, and one for blocks. In real-life data, examples include gender (male or female), class (first, second, third), or colors (red, blue, green).
In Pandas, categorical data can make computations faster and save memory. It's like organizing toys so you can find the one you need quickly!
Starting with this lesson we will from time to time work with a real data, not just toy examples. Welcome the famous titanic dataset, containing information about the Titanic's passengers and whether they survived or not! This dataset mainly comprises data about the passengers' demographics and their travel details, which can be used to predict passenger survival on the Titanic. For instance, it includes features like the ticket fare, the passenger's class or the passenger's age.
This dataset has multiple categorical columns. The most straightforward example is the 'sex'
column, which contains either "male"
or "female"
So why convert data to categorical types?
- Memory Efficiency: Categorical data takes up less memory than string data by storing only distinct values and using codes.
- Performance: Operations on categorical data are faster than on string data because comparisons use integer codes.
- Clarity: It indicates that a column contains specific categories rather than free text.
Let's see a practical example using the Titanic dataset, which contains passenger details like gender and class. By converting columns like sex
and class
to categorical types, we can make operations more efficient.
Let's convert DataFrame columns to categorical types using the Titanic dataset. We'll use the .astype()
method in Pandas.
Python1import pandas as pd 2import seaborn as sns 3 4# Load Titanic dataset 5titanic = sns.load_dataset('titanic') 6 7# Before conversion 8print("Before Conversion:\n", titanic.info()) 9# Before conversion output (first parts of the data only, for brevity) 10# Data columns (total 15 columns): 11# # Column Non-Null Count Dtype 12# --- ------ -------------- ----- 13# 0 survived 891 non-null int64 14# 1 pclass 891 non-null int64 15# 2 sex 891 non-null object # <- Important parts to observe for this lesson 16# 3 age 714 non-null float64 17# 4 sibsp 891 non-null int64 18# 5 parch 891 non-null int64 19# 6 fare 891 non-null float64 20# 7 embarked 889 non-null object 21# 8 class 891 non-null object # <- Important parts to observe for this lesson 22# 9 who 891 non-null object 23# ... (Differs by dtypes per each column)
In this slide, you can see how to load the titanic dataset and all its info. This lesson, we will focus only on the sex
and class
columns, containing passenger's sex and ticket class, respectively.
Now let's convert the sex
and class
columns and reprint the DataFrame information.
Python1# Convert 'sex' and 'class' columns to categorical types 2titanic['sex'] = titanic['sex'].astype('category') 3titanic['class'] = titanic['class'].astype('category') 4 5# After conversion 6print("After Conversion:\n", titanic.info()) 7# After conversion output (first parts of the data only, for brevity) 8# Data columns (total 15 columns): 9# # Column Non-Null Count Dtype 10# --- ------ -------------- ----- 11# 0 survived 891 non-null int64 12# 1 pclass 891 non-null int64 13# 2 sex 891 non-null category # <- Changed type 14# 3 age 714 non-null float64 15# 4 sibsp 891 non-null int64 16# 5 parch 891 non-null int64 17# 6 fare 891 non-null float64 18# 7 embarked 889 non-null object 19# 8 class 891 non-null category # <- Changed type 20# 9 who 891 non-null object 21# ... (Differs by dtypes per each column)
Notice how sex
and class
changed from object
to category
. This confirms the conversion was successful. This way, Pandas now treats these columns as categorical data, optimizing memory and performance.
Sometimes, you must convert categorical data to numeric codes for machine learning models. Let's see how to encode the sex
column with label encoding. It is the simplest encoding, which replaces categories with some numbers. For example, male
with 0
and female
with 1
.
Python1# Label encoding the 'sex' column 2titanic['sex_code'] = titanic['sex'].cat.codes 3print(titanic.head()) 4# Output: 5# survived pclass sex age sibsp parch fare embarked class \ 6# 0 0 3 male 22.0 1 0 7.2500 S Third 7# 1 1 1 female 38.0 1 0 71.2833 C First 8# 2 1 3 female 26.0 0 0 7.9250 S Third 9# 3 1 1 female 35.0 1 0 53.1000 S First 10# 4 0 3 male 35.0 0 0 8.0500 S Third 11# who adult_male deck embark_town alive alone sex_code 12# 0 man True NaN Southampton no False 1 13# 1 woman False C Cherbourg yes False 0 14# 2 woman False NaN Southampton yes True 0 15# 3 woman False C Southampton yes False 0 16# 4 man True NaN Southampton no True 1
cat.codes
is an attribute of Pandas' Categorical
type that returns the codes corresponding to the categories in the categorical data. When used, it converts each category into an integer code. For example, if the categorical data has categories ['male', 'female']
, it might convert male
to 0
and female
to 1
.
Now, let's see an example of one-hot encoding. This encoding will create a separate column for each category.
Python1# One-hot encoding the 'class' column 2titanic_class_dummies = pd.get_dummies(titanic['class'], prefix='class') 3titanic = pd.concat([titanic, titanic_class_dummies], axis=1) 4print(titanic.head()) 5# Output: 6# survived pclass sex age sibsp parch fare embarked class \ 7# 0 0 3 male 22.0 1 0 7.2500 S Third 8# 1 1 1 female 38.0 1 0 71.2833 C First 9# 2 1 3 female 26.0 0 0 7.9250 S Third 10# 3 1 1 female 35.0 1 0 53.1000 S First 11# 4 0 3 male 35.0 0 0 8.0500 S Third 12# who adult_male deck embark_town alive alone sex_code class_First \ 13# 0 man True NaN Southampton no False 1 0 14# 1 woman False C Cherbourg yes False 0 1 15# 2 woman False NaN Southampton yes True 0 0 16# 3 woman False C Southampton yes False 0 1 17# 4 man True NaN Southampton no True 1 0 18# class_Second class_Third 19# 0 0 1 20# 1 0 0 21# 2 0 1 22# 3 0 0 23# 4 0 1
The pd.get_dummies
function creates a separate dataframe with encoded values, performing the one-hot encoding. Next, we append this new dataframe to the original one using the concat
function. One-hot encoding creates new columns for each category of class
(e.g., class_first
, class_second
, class_third
), with binary values indicating each category's presence in the record.
Today, we've learned:
- What categorical data is: Data divided into specific categories.
- Why it's beneficial to convert to categorical types: For memory efficiency and better performance.
- How to perform the conversion: Using the
astype('category')
method in Pandas. - Encoding examples: Label encoding and one-hot encoding to convert categories into numeric forms.
Now it's time to get hands-on! In the upcoming practice tasks, you'll apply what you've learned. You'll convert columns to categorical types and practice encoding them. This practice will solidify your understanding and build confidence in handling categorical data in Pandas. Let's dive in!