Data Cleaning Techniques: Working with Categorical Data Encoding and Transformation

Intro to Data Cleaning and Preprocessing with TitanicLesson 2

Lesson 2

Introduction to Encoding and Transforming Categorical Data

In this lesson, we will delve into the aspect of encoding and transforming categorical data present in a dataset. By generating numerical representations, we make it possible to build models using datasets that contain categorical variables. This session focuses on introducing you to different types of categorical data encodings, understanding their use, and learning how to apply them.

Understanding categorical variable encoding is essential for a wide array of machine-learning tasks. Sadly, not all algorithms can understand human language the way we do. By converting these text data into numbers, we are translating the data into a format that algorithms can process - and that's what we will cover in this lesson.

Any guesses on the effects that a passenger's gender or embarkation point might have on their survival rates? We address these issues by using different types of encoding techniques to convert the gender and embarkation point details into a form that a machine learning model can understand.

Gearing Up: Load Libraries and Dataset

While Python provides built-in methods for encoding, the Pandas library shines with its efficiency and simplicity. Let's begin by loading our libraries and dataset.

Python
1import pandas as pd
2import seaborn as sns
3
4# Load Titanic dataset
5titanic_df = sns.load_dataset('titanic')

The above code will load the Titanic dataset and allow us to transform it using different techniques, shown in the following sections.

Handling Categorical Variables

As part of this session, we mainly consider two categorical variables from the Titanic dataset, sex and embark_town. These columns are in a text format to which our algorithms can't relate. Hence, we use different encoding techniques to solve our problem.

Python
1# Display unique categories in 'sex' and 'embark_town'
2print(titanic_df['sex'].unique()) # Output: ['male' 'female']
3print(titanic_df['embark_town'].unique()) # Output: ['Southampton' 'Cherbourg' 'Queenstown' nan]

This prints out all unique categories within sex and embark_town columns. These categories can be encoded to numbers in a few ways, as shown in the following sections.

Label Encoding with Pandas

Label encoding converts each category in the variable to a numerical value. You can accomplish this using the factorize() function in Pandas.

Python
1# Label Encoding for 'sex'
2titanic_df['sex_encoded'] = pd.factorize(titanic_df['sex'])[0]
3print(titanic_df[['sex', 'sex_encoded']].head())
4"""
5      sex  sex_encoded
60    male            0
71  female            1
82  female            1
93  female            1
104    male            0
11"""

In this example, the factorize() function assigns numerical values to each category in the sex column. A new column, sex_encoded, is then created to store these encoded values. If you print out the first few records of the sex and sex_encoded columns, you'll see the male and female categories transformed into 0 and 1, respectively.

It is important to note the use of [0] in the code. The factorize() function returns two items: the first is an array containing the encoded labels (the actual numerical representation), and the second is an array containing the unique values. By using [0], we're choosing only to take the first item (the numerical labels), ignoring the unique values.

One-Hot Encoding with Pandas

One-hot encoding is another common method for encoding categorical variables. It creates a binary column for each category in the variable. This is especially useful when there is no ordinal relationship between the categories, just like in our embark_town example. Pandas' get_dummies() is used for this:

Python
1# One-Hot Encoding for 'embark_town'
2encoded_df = pd.get_dummies(titanic_df['embark_town'], prefix='town')
3titanic_df = pd.concat([titanic_df, encoded_df], axis=1)
4print(titanic_df.head())
5"""
6   survived  pclass     sex  ...  town_Cherbourg  town_Queenstown  town_Southampton
70         0       3    male  ...           False            False              True
81         1       1  female  ...            True            False             False
92         1       3  female  ...           False            False              True
103         1       1  female  ...           False            False              True
114         0       3    male  ...           False            False              True
12"""

This script will create three additional columns, town_Cherbourg, town_Queenstown, and town_Southampton for the three categories in embark_town. It assigns 1 to the relevant category and 0 to others, making it easier for algorithms to understand.

Wrapping Up the Lesson

Well done! In this lesson, you have learned to encode and transform categorical data using different encoding schemes, such as Label Encoding and One-Hot Encoding. You have learned the how and when of applying these encoding strategies to facilitate effective data preprocessing.

Summarizing, we explored how to handle categorical data, transformed different types of categorical data to numbers using Label Encoding and One-hot Encoding methods, and updated our Titanic dataset.

Ready to Practice?

On completing this lesson, you are now equipped to transform categorical data. The upcoming practice exercises will allow you to apply these concepts. They are designed to solidify your understanding and prepare you for more complex tasks in the field of Data Scientist & Analyst. Now, you can definitely answer if gender or embarkation point is associated with survival rate! Good luck!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.