Welcome! Today, we are going to learn about the Decision Tree Classifier. It's one of the basic tools in machine learning that helps us make decisions like a flowchart. Imagine deciding whether to wear a coat. If it's cold, you wear it; if not, you don't. This is similar to how a Decision Tree works in predicting outcomes based on given data.
By the end of this lesson, you will know:
Decision Tree Classifier
to make predictions.Let's start by looking at each of these steps one by one.
In machine learning, data is very important. We will use the wine dataset from Scikit-Learn
. As a reminder, this dataset has measurements of different wines, and our goal is to predict the class of wine.
Here's a quick reminder on how to load and split this dataset:
Python1from sklearn.datasets import load_wine 2from sklearn.model_selection import train_test_split 3 4# Load the dataset 5X, y = load_wine(return_X_y=True) 6 7# Split the dataset 8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
A Decision Tree is a type of supervised learning model used for classification and regression tasks. It is a flowchart-like structure where:
Here is the example:
Imagine a simple decision tree for classifying whether an animal is a mammal.
Root Node: Start with a feature, such as whether the animal has fur.
First Decision Node: If the animal has fur, check if it gives birth.
This decision-making process can be visualized as follows:
A decision tree is trained through a process called recursive partitioning, which involves the following steps:
Now, let's train our Decision Tree Classifier
. This is like creating the described "decision flowchart" based on our training data.
Here’s how to do it with Scikit-Learn
:
Python1from sklearn.tree import DecisionTreeClassifier 2 3# Create the classifier with some parameters 4tree_clf = DecisionTreeClassifier(max_depth=5, min_samples_split=3) 5 6# Train the classifier 7tree_clf.fit(X_train, y_train)
To evaluate our trained Decision Tree Classifier, we will calculate its accuracy on the testing set. As a reminder, accuracy is the ratio of correctly predicted instances to the total instances in the dataset.
Here’s how you can do it:
Python1from sklearn.metrics import accuracy_score 2 3# Predict the labels for the test set 4y_pred = tree_clf.predict(X_test) 5 6# Calculate the accuracy 7accuracy = accuracy_score(y_test, y_pred) 8 9print(f"Accuracy of the Decision Tree Classifier: {accuracy:.2f}") # 0.94
When creating a decision tree, you can adjust several parameters to control its complexity and performance:
As you can see, the first three of the parameters control how deep the tree will go. It helps preventing overfitting, keeping the tree reasonably simple.
Let's recap:
Decision Tree Classifier
using the fit
method.Now that you have learned the theory, it's time for hands-on practice. You will get to load data, split it, and train your own Decision Tree Classifier
. This will help solidify what you’ve just learned. Let's get to it!