Encoding Categorical Features

Lesson 2

Lesson Introduction

Welcome! Today, we're learning about Encoding Categorical Features. Have you ever thought about how computers understand things like colors, car brands, or animal types? These are categorical features. Computers are good at understanding numbers but not words, so we convert these words into numbers. This process is called encoding.

Our goal is to understand categorical features, why they need encoding, and how to use OneHotEncoder and LabelEncoder from SciKit Learn to do this. By the end, you'll be able to transform categorical data into numerical data for machine learning.

Introduction to Categorical Features

First, let's understand categorical features. Think about categories you see daily, like different types of fruits (apple, banana, cherry) or car colors (red, blue, green). These are examples of categorical features because they represent groups. In machine learning, these features must be converted to numbers to be understood.

Why encode these features? Machine learning algorithms only work with numerical data. It's like translating a book to another language; we convert categorical features to numbers so our models can "read" the data.

If a dataset includes car colors like Red, Blue, and Green, our model won't understand these words. We transform them into numbers for the model to use.

Introducing OneHotEncoder

One-hot encoding is a method to convert categorical data into a numerical format by creating binary columns for each category. Each column represents one category, and contains a 1 if the category is present and a 0 if it is not. Here, let's look at an example for a better understanding. We will encode data with OneHotEncoder step-by-step.

Python
1import pandas as pd
2from sklearn.preprocessing import OneHotEncoder
3
4data = {'Feature': ['A', 'B', 'C', 'A']}
5df = pd.DataFrame(data)

We import Pandas and OneHotEncoder from SciKit Learn. Pandas handles data, and OneHotEncoder converts categorical features to numbers.

Then, we create a small dataset with the letters A, B, C, and A, which will be our categories. Though this one particular dataset is just an example, you can face something similar in the real data. Imagine processing data about IT-companies offices, where each office is assigned with a class: A, B or C!

Working with OneHotEncoder

Python
1encoder = OneHotEncoder(sparse_output=False)

We create an encoder object. The parameter sparse_output=False gives us a dense output, which is easier to read.

Python
1encoded_data = encoder.fit_transform(df)

We fit the encoder to our data and transform it. fit learns the categories, and transform converts the data into numbers.

Python
1columns = encoder.get_feature_names_out(df.columns)
2encoded_df = pd.DataFrame(encoded_data, columns=columns)
3print(encoded_df)

This produces a DataFrame that looks like this:


1   Feature_A  Feature_B  Feature_C
20        1.0        0.0        0.0
31        0.0        1.0        0.0
42        0.0        0.0        1.0
53        1.0        0.0        0.0

Each column represents one original category, and each row shows if that category was present.

Using the `drop` Parameter in OneHotEncoder

In some cases, you might want to avoid generating a binary column for every category to prevent multicollinearity, especially if the categories are highly correlated. The drop parameter in OneHotEncoder helps with this by allowing you to specify which category to drop.

Here's how to use the drop parameter with our existing example:

Python
1encoder = OneHotEncoder(sparse_output=False, drop='first')

By setting drop='first', we instruct the encoder to drop the first category (in this case, 'A') from the encoding. Let's see the result:

Python
1encoded_data = encoder.fit_transform(df)
2columns = encoder.get_feature_names_out(df.columns)
3encoded_df = pd.DataFrame(encoded_data, columns=columns)
4print(encoded_df)

The resulting DataFrame will look like this:


1   Feature_B  Feature_C
20        0.0        0.0
31        1.0        0.0
42        0.0        1.0
53        0.0        0.0

Here, 'A' has been dropped, and only 'B' and 'C' are encoded. This approach maintains the information while reducing redundancy in your dataset.

Encoding Specific Columns

Sometimes, you might have a dataset with multiple columns, but you only want to encode specific categorical columns. You can achieve this by directly accessing and transforming the specified columns.

To use OneHotEncoder on a specific column, you can fit and transform that column separately and then concatenate it back to the original DataFrame.

Python
1import pandas as pd
2from sklearn.preprocessing import OneHotEncoder
3
4# Original dataset
5data = {
6    'Category': ['A', 'B', 'C', 'A'],
7    'Value': [10, 20, 30, 40]
8}
9df = pd.DataFrame(data)
10
11# Initializing the OneHotEncoder
12encoder = OneHotEncoder(sparse_output=False)
13
14# Fit and transform the 'Category' column
15encoded_category = encoder.fit_transform(df[['Category']])
16
17# Create a DataFrame for the encoded columns
18encoded_columns = encoder.get_feature_names_out(['Category'])
19encoded_df = pd.DataFrame(encoded_category, columns=encoded_columns)
20
21# Concatenate the encoded columns back to the original DataFrame
22df_encoded = pd.concat([encoded_df, df.drop('Category', axis=1)], axis=1)
23print(df_encoded)

This will produce a DataFrame that looks like:


1   Category_A  Category_B  Category_C  Value
20         1.0         0.0         0.0     10
31         0.0         1.0         0.0     20
42         0.0         0.0         1.0     30
53         1.0         0.0         0.0     40

Notice that only the 'Category' column is encoded, while the 'Value' column remains unchanged.

Introducing LabelEncoder

While OneHotEncoder is useful for many categories, sometimes you might want to use Label Encoding. This method assigns a unique number to each category, which can be simpler but may imply an order. We import it in a same way as the OneHotEncoder:

Python
1from sklearn.preprocessing import LabelEncoder

Working with it is very similar. It has the same fit_transform method:

Python
1label_encoder = LabelEncoder()
2label_encoded_data = label_encoder.fit_transform(df['Feature'])
3print(label_encoded_data)  # [0 1 2 0]

This converts our categorical data into numbers. 'A' is encoded as 0, 'B' as 1, and 'C' as 2.

Practical Importance of OneHotEncoder and LabelEncoder

OneHotEncoder is helpful when you have multiple categories, like movie genres (Action, Comedy, Drama), to avoid implying any order or importance. While LabelEncoder can be simpler, it may mislead the model by implying an order when there isn't one. However, it can be useful when dealing with ordinal data or when the categorical feature has a natural order (like ratings: bad, average, good). Additionally, LabelEncoder is more memory-efficient and computationally faster for algorithms that can handle numeric representations of the categories directly.

Lesson Summary

Today, we explored categorical features and why they need encoding for machine learning models. We learned about OneHotEncoder and LabelEncoder and saw examples of how to convert categorical data into numerical data. You now understand how to use both encoders to preprocess your data for machine learning models.

Now, it's time for practice! In the next part, you'll apply OneHotEncoder and LabelEncoder to different datasets to get hands-on experience. This practice will help solidify what you've learned and prepare you for working with real-world data. Good luck!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.