Hello and welcome! In this lesson, we are going to learn how to convert categorical data into ordered types using the Diamonds dataset from the seaborn
library. The goal of this lesson is to enable you to transform categorical data into ordered categorical types effectively. Understanding this process is crucial for improving data analysis and visualization.
Categorical data is data that can be divided into groups or categories. For example, the grades students receive (A, B, C, etc.), types of cars (SUV, Sedan, Truck), and the levels of satisfaction in a survey (Poor, Fair, Good, Very Good, Excellent) are all examples of categorical data.
In the Diamonds dataset, we have categorical columns such as cut
, color
, and clarity
:
cut
describes the quality of the diamond cut (e.g., Fair, Good, Very Good, Premium, Ideal).color
indicates the color grading of a diamond (e.g., D, E, F, G, H, I, J).clarity
represents the clarity of the diamond (e.g., I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF).
Converting categorical data to ordered types is essential for several reasons:
- Sorting: Ordered categorical data can be sorted meaningfully.
- Analysis: Many statistical analyses and visualizations require data to be ordered.
- Representation: Ordered types provide a clear hierarchy or ranking for categorical variables.
For example, in the context of diamond quality:
- Cut: Fair < Good < Very Good < Premium < Ideal
- Color: J < I < H < G < F < E < D
- Clarity: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF
To convert the categorical columns in our dataset to ordered types, follow these steps:
-
Define the category order: First, specify the order of the categories for
cut
,color
, andclarity
.Python1cut_categories = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'] 2color_categories = ['J', 'I', 'H', 'G', 'F', 'E', 'D'] 3clarity_categories = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
-
Convert to categorical types: Use the
pd.Categorical
method from Pandas to specify the order for each categorical column.Python1diamonds['cut'] = pd.Categorical(diamonds['cut'], categories=cut_categories, ordered=True) 2diamonds['color'] = pd.Categorical(diamonds['color'], categories=color_categories, ordered=True) 3diamonds['clarity'] = pd.Categorical(diamonds['clarity'], categories=clarity_categories, ordered=True)
-
Verify the conversion: Print the
cat.ordered
attribute to confirm that the columns have been converted correctly. You can also confirm the order of the categories by accessingcategories
, as shown in the code below.Python1# Confirm the conversions 2print(diamonds['cut'].cat.ordered) 3print(diamonds['color'].cat.ordered) 4print(diamonds['cut'].cat.ordered) 5 6# Print the order 7print(diamonds['cut'].cat.categories)
The output of the above code will be:
Plain text1True 2True 3True 4Index(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], dtype='object')
This output shows the data types of each column after conversion, indicating that cut
, color
, and clarity
have been successfully converted to ordered categorical types, which will allow for more meaningful sorting and analysis.
Great job! In this lesson, you learned how to convert categorical data to ordered types in the Diamonds dataset. This process is crucial for sorting, analysis, and better representation of categorical data. By defining the order of categories and applying the pd.Categorical
method, you can ensure that your data is accurately represented.
Next, you'll practice this essential skill by applying the technique, reinforcing your understanding and improving your data preprocessing capabilities. By mastering this skill, you'll be better prepared for more advanced data analysis and machine learning tasks. Keep practicing and stay curious!