Welcome! Today, we're learning about Encoding Categorical Features. Have you ever thought about how computers understand things like colors, car brands, or animal types? These are categorical features. Computers are good at understanding numbers but not words, so we convert these words into numbers. This process is called encoding.
Our goal is to understand categorical features, why they need encoding, and how to use OneHotEncoder
and LabelEncoder
from SciKit Learn
to do this. By the end, you'll be able to transform categorical data into numerical data for machine learning.
First, let's understand categorical features. Think about categories you see daily, like different types of fruits (apple, banana, cherry) or car colors (red, blue, green). These are examples of categorical features because they represent groups. In machine learning, these features must be converted to numbers to be understood.
Why encode these features? Machine learning algorithms only work with numerical data. It's like translating a book to another language; we convert categorical features to numbers so our models can "read" the data.
If a dataset includes car colors like Red, Blue, and Green, our model won't understand these words. We transform them into numbers for the model to use.
One-hot encoding is a method to convert categorical data into a numerical format by creating binary columns for each category. Each column represents one category, and contains a 1
if the category is present and a 0
if it is not. Here, let's look at an example for a better understanding. We will encode data with OneHotEncoder
step-by-step.
Python1import pandas as pd 2from sklearn.preprocessing import OneHotEncoder 3 4data = {'Feature': ['A', 'B', 'C', 'A']} 5df = pd.DataFrame(data)
We import Pandas
and OneHotEncoder
from SciKit Learn
. Pandas
handles data, and OneHotEncoder
converts categorical features to numbers.
Then, we create a small dataset with the letters A
, B
, C
, and A
, which will be our categories. Though this one particular dataset is just an example, you can face something similar in the real data. Imagine processing data about IT-companies offices, where each office is assigned with a class: A
, B
or C
!
Python1encoder = OneHotEncoder(sparse_output=False)
We create an encoder
object. The parameter sparse_output=False
gives us a dense output, which is easier to read.
Python1encoded_data = encoder.fit_transform(df)
We fit the encoder to our data and transform it. fit
learns the categories, and transform
converts the data into numbers.
Python1columns = encoder.get_feature_names_out(df.columns) 2encoded_df = pd.DataFrame(encoded_data, columns=columns) 3print(encoded_df)
This produces a DataFrame that looks like this:
1 Feature_A Feature_B Feature_C 20 1.0 0.0 0.0 31 0.0 1.0 0.0 42 0.0 0.0 1.0 53 1.0 0.0 0.0
Each column represents one original category, and each row shows if that category was present.
In some cases, you might want to avoid generating a binary column for every category to prevent multicollinearity, especially if the categories are highly correlated. The drop
parameter in OneHotEncoder
helps with this by allowing you to specify which category to drop.
Here's how to use the drop
parameter with our existing example:
Python1encoder = OneHotEncoder(sparse_output=False, drop='first')
By setting drop='first'
, we instruct the encoder to drop the first category (in this case, 'A') from the encoding. Let's see the result:
Python1encoded_data = encoder.fit_transform(df) 2columns = encoder.get_feature_names_out(df.columns) 3encoded_df = pd.DataFrame(encoded_data, columns=columns) 4print(encoded_df)
The resulting DataFrame will look like this:
1 Feature_B Feature_C 20 0.0 0.0 31 1.0 0.0 42 0.0 1.0 53 0.0 0.0
Here, 'A' has been dropped, and only 'B' and 'C' are encoded. This approach maintains the information while reducing redundancy in your dataset.
Sometimes, you might have a dataset with multiple columns, but you only want to encode specific categorical columns. You can achieve this by directly accessing and transforming the specified columns.
To use OneHotEncoder
on a specific column, you can fit and transform that column separately and then concatenate it back to the original DataFrame.
Python1import pandas as pd 2from sklearn.preprocessing import OneHotEncoder 3 4# Original dataset 5data = { 6 'Category': ['A', 'B', 'C', 'A'], 7 'Value': [10, 20, 30, 40] 8} 9df = pd.DataFrame(data) 10 11# Initializing the OneHotEncoder 12encoder = OneHotEncoder(sparse_output=False) 13 14# Fit and transform the 'Category' column 15encoded_category = encoder.fit_transform(df[['Category']]) 16 17# Create a DataFrame for the encoded columns 18encoded_columns = encoder.get_feature_names_out(['Category']) 19encoded_df = pd.DataFrame(encoded_category, columns=encoded_columns) 20 21# Concatenate the encoded columns back to the original DataFrame 22df_encoded = pd.concat([encoded_df, df.drop('Category', axis=1)], axis=1) 23print(df_encoded)
This will produce a DataFrame that looks like:
1 Category_A Category_B Category_C Value 20 1.0 0.0 0.0 10 31 0.0 1.0 0.0 20 42 0.0 0.0 1.0 30 53 1.0 0.0 0.0 40
Notice that only the 'Category' column is encoded, while the 'Value' column remains unchanged.
While OneHotEncoder
is useful for many categories, sometimes you might want to use Label Encoding. This method assigns a unique number to each category, which can be simpler but may imply an order. We import it in a same way as the OneHotEncoder
:
Python1from sklearn.preprocessing import LabelEncoder
Working with it is very similar. It has the same fit_transform
method:
Python1label_encoder = LabelEncoder() 2label_encoded_data = label_encoder.fit_transform(df['Feature']) 3print(label_encoded_data) # [0 1 2 0]
This converts our categorical data into numbers. 'A'
is encoded as 0
, 'B'
as 1
, and 'C'
as 2
.
OneHotEncoder
is helpful when you have multiple categories, like movie genres (Action, Comedy, Drama), to avoid implying any order or importance. While LabelEncoder
can be simpler, it may mislead the model by implying an order when there isn't one. However, it can be useful when dealing with ordinal data or when the categorical feature has a natural order (like ratings: bad, average, good). Additionally, LabelEncoder
is more memory-efficient and computationally faster for algorithms that can handle numeric representations of the categories directly.
Today, we explored categorical features and why they need encoding for machine learning models. We learned about OneHotEncoder
and LabelEncoder
and saw examples of how to convert categorical data into numerical data. You now understand how to use both encoders to preprocess your data for machine learning models.
Now, it's time for practice! In the next part, you'll apply OneHotEncoder
and LabelEncoder
to different datasets to get hands-on experience. This practice will help solidify what you've learned and prepare you for working with real-world data. Good luck!