Handling Categorical Data

Lesson 1

Lesson Introduction

Welcome to our lesson on handling categorical data with Pandas! We're diving into a critical aspect of data manipulation. Data comes in various types, and one of the crucial types is categorical data — data divided into specific categories.

By the end of this lesson, you'll understand how to convert columns in a DataFrame to categorical types, why it's important, and how to verify the conversion. We'll also see an example of encoding categorical data efficiently. Let's get started!

Understanding Categorical Data

Categorical data can be divided into groups or categories. It's like sorting toys into different bins: one for cars, one for dolls, and one for blocks. In real-life data, examples include gender (male or female), class (first, second, third), or colors (red, blue, green).

In Pandas, categorical data can make computations faster and save memory. It's like organizing toys so you can find the one you need quickly!

Starting with this lesson we will from time to time work with a real data, not just toy examples. Welcome the famous titanic dataset, containing information about the Titanic's passengers and whether they survived or not! This dataset mainly comprises data about the passengers' demographics and their travel details, which can be used to predict passenger survival on the Titanic. For instance, it includes features like the ticket fare, the passenger's class or the passenger's age.

This dataset has multiple categorical columns. The most straightforward example is the 'sex' column, which contains either "male" or "female"

Why Convert to Categorical Data

So why convert data to categorical types?

Memory Efficiency: Categorical data takes up less memory than string data by storing only distinct values and using codes.
Performance: Operations on categorical data are faster than on string data because comparisons use integer codes.
Clarity: It indicates that a column contains specific categories rather than free text.

Let's see a practical example using the Titanic dataset, which contains passenger details like gender and class. By converting columns like sex and class to categorical types, we can make operations more efficient.

Identifying Categorical Data

Let's convert DataFrame columns to categorical types using the Titanic dataset. We'll use the .astype() method in Pandas.

Python
1import pandas as pd
2import seaborn as sns
3
4# Load Titanic dataset
5titanic = sns.load_dataset('titanic')
6
7# Before conversion
8print("Before Conversion:\n", titanic.info())
9# Before conversion output (first parts of the data only, for brevity)
10# Data columns (total 15 columns):
11#  #   Column       Non-Null Count  Dtype   
12# ---  ------       --------------  -----   
13#  0   survived     891 non-null    int64   
14#  1   pclass       891 non-null    int64   
15#  2   sex          891 non-null    object  # <- Important parts to observe for this lesson
16#  3   age          714 non-null    float64 
17#  4   sibsp        891 non-null    int64   
18#  5   parch        891 non-null    int64   
19#  6   fare         891 non-null    float64 
20#  7   embarked     889 non-null    object  
21#  8   class        891 non-null    object  # <- Important parts to observe for this lesson
22#  9   who          891 non-null    object  
23#  ... (Differs by dtypes per each column)

In this slide, you can see how to load the titanic dataset and all its info. This lesson, we will focus only on the sex and class columns, containing passenger's sex and ticket class, respectively.

How to Convert Columns to Categorical Types

Now let's convert the sex and class columns and reprint the DataFrame information.

Python
1# Convert 'sex' and 'class' columns to categorical types
2titanic['sex'] = titanic['sex'].astype('category')
3titanic['class'] = titanic['class'].astype('category')
4
5# After conversion
6print("After Conversion:\n", titanic.info())
7# After conversion output (first parts of the data only, for brevity)
8# Data columns (total 15 columns):
9#  #   Column       Non-Null Count  Dtype   
10# ---  ------       --------------  -----   
11#  0   survived     891 non-null    int64   
12#  1   pclass       891 non-null    int64   
13#  2   sex          891 non-null    category # <- Changed type
14#  3   age          714 non-null    float64 
15#  4   sibsp        891 non-null    int64   
16#  5   parch        891 non-null    int64   
17#  6   fare         891 non-null    float64 
18#  7   embarked     889 non-null    object  
19#  8   class        891 non-null    category # <- Changed type
20#  9   who          891 non-null    object  
21#  ... (Differs by dtypes per each column)

Notice how sex and class changed from object to category. This confirms the conversion was successful. This way, Pandas now treats these columns as categorical data, optimizing memory and performance.

Encoding Examples: Label Encoding

Sometimes, you must convert categorical data to numeric codes for machine learning models. Let's see how to encode the sex column with label encoding. It is the simplest encoding, which replaces categories with some numbers. For example, male with 0 and female with 1.

Python
1# Label encoding the 'sex' column
2titanic['sex_code'] = titanic['sex'].cat.codes
3print(titanic.head())
4# Output:
5#    survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
6# 0         0       3    male  22.0      1      0   7.2500        S   Third   
7# 1         1       1  female  38.0      1      0  71.2833        C   First   
8# 2         1       3  female  26.0      0      0   7.9250        S   Third   
9# 3         1       1  female  35.0      1      0  53.1000        S   First   
10# 4         0       3    male  35.0      0      0   8.0500        S   Third   
11#    who  adult_male deck  embark_town alive  alone  sex_code  
12# 0  man        True  NaN  Southampton    no  False         1  
13# 1 woman       False    C    Cherbourg yes False         0  
14# 2 woman       False  NaN  Southampton   yes   True         0  
15# 3 woman       False    C  Southampton yes False         0  
16# 4    man        True  NaN  Southampton    no   True         1

cat.codes is an attribute of Pandas' Categorical type that returns the codes corresponding to the categories in the categorical data. When used, it converts each category into an integer code. For example, if the categorical data has categories ['male', 'female'], it might convert male to 0 and female to 1.

Encoding Examples: One-Hot Encoding

Now, let's see an example of one-hot encoding. This encoding will create a separate column for each category.

Python
1# One-hot encoding the 'class' column
2titanic_class_dummies = pd.get_dummies(titanic['class'], prefix='class')
3titanic = pd.concat([titanic, titanic_class_dummies], axis=1)
4print(titanic.head())
5# Output:
6#    survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
7# 0         0       3    male  22.0      1      0   7.2500        S   Third   
8# 1         1       1  female  38.0      1      0  71.2833        C   First   
9# 2         1       3  female  26.0      0      0   7.9250        S   Third   
10# 3         1       1  female  35.0      1      0  53.1000        S   First   
11# 4         0       3    male  35.0      0      0   8.0500        S   Third   
12#    who  adult_male deck  embark_town alive  alone  sex_code  class_First  \
13# 0  man        True  NaN  Southampton    no  False         1             0   
14# 1 woman       False    C    Cherbourg yes False         0            1   
15# 2 woman       False  NaN  Southampton   yes   True         0            0   
16# 3 woman       False    C  Southampton yes False         0            1   
17# 4    man        True  NaN  Southampton    no   True         1             0   
18#    class_Second  class_Third  
19# 0             0            1  
20# 1             0            0  
21# 2             0            1  
22# 3             0            0  
23# 4             0            1

The pd.get_dummies function creates a separate dataframe with encoded values, performing the one-hot encoding. Next, we append this new dataframe to the original one using the concat function. One-hot encoding creates new columns for each category of class (e.g., class_first, class_second, class_third), with binary values indicating each category's presence in the record.

Lesson Summary

Today, we've learned:

What categorical data is: Data divided into specific categories.
Why it's beneficial to convert to categorical types: For memory efficiency and better performance.
How to perform the conversion: Using the astype('category') method in Pandas.
Encoding examples: Label encoding and one-hot encoding to convert categories into numeric forms.

Now it's time to get hands-on! In the upcoming practice tasks, you'll apply what you've learned. You'll convert columns to categorical types and practice encoding them. This practice will solidify your understanding and build confidence in handling categorical data in Pandas. Let's dive in!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.