Hello, Space Voyager! Today, we're venturing through a fascinating territory: Categorical Data Encoding! Categorical Data consist of groups or traits such as "gender", "marital status", or "hometown". We convert categories into numbers using Label
and One-Hot Encoding
techniques for our machine-learning mates.
Label Encoding
maps categories to numbers ranging from 0
through N-1
, where N
represents the unique category count. It's beneficial for ordered data like "Small"
, "Medium"
, and "Large"
.
To illustrate, here is a Python list of shirt sizes:
Python1sizes = ["Small", "Medium", "Large"]
Python's Pandas library can be used to assign 0 to "Small", 1 to "Medium", and 2 to "Large":
Python1import pandas as pd 2 3 4df = pd.DataFrame({ 5 'item_id': [1302, 1440, 1220, 2038, 1102], 6 'sizes': ['Small', 'Medium', 'Large', 'Small', 'Medium'] 7}) 8 9size_mapping = {"Small": 0, "Medium": 1, "Large": 2} 10df['sizes'] = df['sizes'].map(size_mapping) # Apply mapping to the specified column 11print(df) 12'''Output: 13 item_id sizes 140 1302 0 151 1440 1 162 1220 2 173 2038 0 184 1102 1 19'''
In this example, we define mapping in the most natural way for it – as a dictionary. Then, we apply this mapping using dataframe's .map
function.
One-Hot Encoding
creates additional columns for each category, placing a 1
for the appropriate category and 0
s elsewhere. It's preferred for nominal data, where order doesn't matter, such as "Red"
, "Green"
, "Blue"
.
You can perform one-hot encoding with Pandas' .get_dummies()
:
Python1import pandas as pd 2 3df = pd.DataFrame({ 4 'item_id': [1302, 1440, 1220, 2038, 1102], 5 'colors': ['Red', 'Green', 'Blue', 'Red', 'Green'] 6}) 7 8df = pd.get_dummies(df, columns=['colors']) # One-hot encode specified column 9print(df) 10'''Output: 11 item_id colors_Blue colors_Green colors_Red 120 1302 False False True 131 1440 False True False 142 1220 True False False 153 2038 False False True 164 1102 False True False 17'''
As One-Hot encoding converts each category value into a new column and assigns a 1
or 0
(True
/False
) value to the column, it does not impose any ordinal relationship among categories where there is none. This can often be the case with labels like 'Red', 'Blue', 'Green'. Each of these categories is distinct, and there is no order. Converting these label categories into a numerical format using label encoding would imply an order, while one-hot encoding does not. It could be helpful for training machine learning models.
Finally, let's address the potential pitfalls of encoding. Label Encoding
can create an unintended order, which may mislead our model. One-Hot Encoding
can slow down our model when used with many unique categories. Consider merging select categories or using different encoding techniques to combat these issues.
For instance, the 'Species'
feature in an 'Animal Shelter' dataset can be restructured to address such problems. Instead of Label Encoding or One-Hot Encoding each unique species like 'Dog'
, 'Cat'
, 'Rabbit'
, and 'Bird'
, we can merge 'Dog'
and 'Cat'
into a new category 'Pet'
, and 'Rabbit'
and 'Bird'
into 'Other'
. This technique reduces our feature's unique categories, making it more model-friendly.
Bravo, Voyager! You've navigated through the realm of Categorical Data Encoding, mastering Label
and One-Hot Encoding
, and gaining insights on pitfalls and best practices. These acquired skills are bona fide assets! Now, gear up for some hands-on work, where you'll practice enhancing your newly learned encoding skills with real-world datasets. See you there!