In this lesson, we will delve into the aspect of encoding and transforming categorical data present in a dataset. By generating numerical representations, we make it possible to build models using datasets that contain categorical variables. This session focuses on introducing you to different types of categorical data encodings, understanding their use, and learning how to apply them.
Understanding categorical variable encoding is essential for a wide array of machine-learning tasks. Sadly, not all algorithms can understand human language the way we do. By converting these text data into numbers, we are translating the data into a format that algorithms can process - and that's what we will cover in this lesson.
Any guesses on the effects that a passenger's gender
or embarkation point
might have on their survival rates? We address these issues by using different types of encoding techniques to convert the gender
and embarkation point
details into a form that a machine learning model can understand.
While Python provides built-in methods for encoding, the Pandas library shines with its efficiency and simplicity. Let's begin by loading our libraries and dataset.
Python1import pandas as pd 2import seaborn as sns 3 4# Load Titanic dataset 5titanic_df = sns.load_dataset('titanic')
The above code will load the Titanic dataset and allow us to transform it using different techniques, shown in the following sections.
As part of this session, we mainly consider two categorical variables from the Titanic dataset, sex
and embark_town
. These columns are in a text format to which our algorithms can't relate. Hence, we use different encoding techniques to solve our problem.
Python1# Display unique categories in 'sex' and 'embark_town' 2print(titanic_df['sex'].unique()) # Output: ['male' 'female'] 3print(titanic_df['embark_town'].unique()) # Output: ['Southampton' 'Cherbourg' 'Queenstown' nan]
This prints out all unique categories within sex
and embark_town
columns. These categories can be encoded to numbers in a few ways, as shown in the following sections.
Label encoding converts each category in the variable to a numerical value. You can accomplish this using the factorize()
function in Pandas.
Python1# Label Encoding for 'sex' 2titanic_df['sex_encoded'] = pd.factorize(titanic_df['sex'])[0] 3print(titanic_df[['sex', 'sex_encoded']].head()) 4""" 5 sex sex_encoded 60 male 0 71 female 1 82 female 1 93 female 1 104 male 0 11"""
In this example, the factorize()
function assigns numerical values to each category in the sex
column. A new column, sex_encoded
, is then created to store these encoded values. If you print out the first few records of the sex
and sex_encoded
columns, you'll see the male and female categories transformed into 0 and 1, respectively.
It is important to note the use of [0]
in the code. The factorize()
function returns two items: the first is an array containing the encoded labels (the actual numerical representation), and the second is an array containing the unique values. By using [0]
, we're choosing only to take the first item (the numerical labels), ignoring the unique values.
One-hot encoding is another common method for encoding categorical variables. It creates a binary column for each category in the variable. This is especially useful when there is no ordinal relationship between the categories, just like in our embark_town
example. Pandas' get_dummies()
is used for this:
Python1# One-Hot Encoding for 'embark_town' 2encoded_df = pd.get_dummies(titanic_df['embark_town'], prefix='town') 3titanic_df = pd.concat([titanic_df, encoded_df], axis=1) 4print(titanic_df.head()) 5""" 6 survived pclass sex ... town_Cherbourg town_Queenstown town_Southampton 70 0 3 male ... False False True 81 1 1 female ... True False False 92 1 3 female ... False False True 103 1 1 female ... False False True 114 0 3 male ... False False True 12"""
This script will create three additional columns, town_Cherbourg
, town_Queenstown
, and town_Southampton
for the three categories in embark_town
. It assigns 1 to the relevant category and 0 to others, making it easier for algorithms to understand.
Well done! In this lesson, you have learned to encode and transform categorical data using different encoding schemes, such as Label Encoding and One-Hot Encoding. You have learned the how and when of applying these encoding strategies to facilitate effective data preprocessing.
Summarizing, we explored how to handle categorical data, transformed different types of categorical data to numbers using Label Encoding and One-hot Encoding methods, and updated our Titanic dataset.
On completing this lesson, you are now equipped to transform categorical data. The upcoming practice exercises will allow you to apply these concepts. They are designed to solidify your understanding and prepare you for more complex tasks in the field of Data Scientist & Analyst. Now, you can definitely answer if gender
or embarkation point
is associated with survival rate! Good luck!