Understanding and Implementing Decision Tree Splits

Lesson 5

Introduction

Welcome to another exciting lesson! Today, we will unlock the mechanism behind the key operation of the Decision Tree algorithm: splitting. We will start with a glance at the structure of Decision Trees, understand the mechanics of splitting, and then dive into the application of the Gini Index, a measure of the quality of a split. Inspired by these insights, we will finish by creating a split function using Python and running it on a sample dataset. So, let's roll up our sleeves and dig into the world of Decision Trees!

Structuring a Decision Tree

A Decision Tree is a tree-like graph that models decisions and their potential consequences. It starts at a single node, called the root, which splits into branches. Each branch, in turn, splits into more branches, forming a hierarchical network. The final branches with no further splits are referred to as leaf nodes. Each split is determined by whether the data satisfies a specific condition.

For instance, if we build a Decision Tree to predict whether a patient will recover from a disease, the root could be temperature> 101F. The tree would then split into two branches - one for yes and another for no. Each branch could further split based on another attribute, such as cough present. This process of splitting continues until we conclude the leaf nodes, such as recovery probable or recovery doubtful. Isn't this a straightforward and intuitive way to make complex decisions?

Understanding the Gini Index With An Example

Let's better understand the Gini Index concept using a tangible example. Imagine we have a basket full of socks of different colors, say red and blue. The goal of a Decision Tree in this context would be to split these socks into separate baskets (or groups) based on their colors. This process is essentially what the Gini Index endeavors to quantify -- the disorder within these groups. A greater Gini Index score signifies more disorder.

The formula is the following: $G = 1 - \sum_{i=1}^{n} p_i^2$

Where:

$G$ is the Gini index or Gini coefficient,
$p_i$ is the proportion of individuals in the $i$ -th group, and the sum is taken over $n$ groups.

If we are successful in segregating the socks into two distinct piles, one with all red socks and the other with all blue socks, our groups are perfectly ordered, and our Gini Index is 0. However, in a case where red and blue socks are randomly mixed together, we have more disorder, resulting in a higher Gini Index.

Let's take a Pythonic approach to this.

First, assume that we have the following sample data of socks:

Python
1groups = [
2    [['Red'], ['Blue'], ['Red']],
3    [['Blue'], ['Red'], ['Blue'], ['Blue']],
4]
5classes = ['Red', 'Blue']

In Python, to calculate our Gini Index, we'll need to tackle several things. The first step is to count the total number of instances or socks:

Python
1n_instances = float(sum([len(group) for group in groups]))

Next, we look at each group (a basket filled with socks of a specific color) and calculate their scores based on the number of each type of sock (class values). The more mixed the colors in a group, the higher the score.

Python
1for group in groups:
2    size = len(group)
3    score = 0.0
4    for class_val in classes:
5        p = [row[-1] for row in group].count(class_val) / size
6        score += p * p

In the end, we adjust each group's score in relation to its size compared to the total number of socks. The final Gini Index is the accumulation of these weighted group scores.

Python
1gini += (1.0 - score) * (size / n_instances)

This is how we compute the Gini Index in Python. In this specific example, the resulting Gini index is 0.404. Remember, the lower the Gini Index, the better our split, just like how we'd prefer our socks sorted effectively by color!

Implementing Decision Tree Split

With our Gini Index, we can decide where to split our data. Imagine we're sorting socks not only by color but also by size, material, and brand. Our test_split function can help us break down the sock pile based on a specific characteristic.

Python
1def test_split(index, value, dataset):
2    left, right = list(), list()
3    for row in dataset:
4        if row[index] < value:
5            left.append(row)
6        else:
7            right.append(row)
8    return left, right

Integrating Gini Index and Split Function

Lastly, we merge gini_index and test_split to create the get_split function. This function scans through all the attributes of our socks and finds the best attribute to split them by – which attribute makes the most distinct piles of socks.

Python
1def get_split(dataset):
2    class_values = list(set(row[-1] for row in dataset))  # Find unique classes in the dataset
3    b_index, b_value, b_score, b_groups = 999, 999, 999, None
4    for index in range(len(dataset[0])-1):    # Exclude the last column which is the class
5        for row in dataset:
6            groups = test_split(index, row[index], dataset)
7            gini = gini_index(groups, class_values)
8            if gini < b_score:
9                b_index, b_value, b_score, b_groups = index, row[index], gini, groups
10    return {'index': b_index, 'value': b_value, 'groups': b_groups}

The code keeps track of the best split column, best split value, a score of the best split, and groups of the best split using variables b_index, b_value, b_score, and b_groups, respectively.

Testing Split Function

We can test our split function this way:

Python
1# Age, Movie Genre, Decision (watch or not)
2
3dataset = [
4    [18, 1, 0],
5    [20, 0, 1],
6    [23, 2, 1],
7    [25, 1, 1],
8    [30, 1, 0],
9]
10
11split = get_split(dataset)
12print('\nBest Split:')
13print('Column Index: %s, Value: %s' % ((split['index']), (split['value'])))
14# Output: Column Index: 0, Value: 20

The initial dataset consists of the user's age, the movie's genre, and a decision of whether the user will watch the film. The get_split function is then run on the dataset, providing the best attribute to split the data. The index represents the column in the dataset that provides the best split (either 0 for age or 1 for genre), and the value is the specific value at which the split occurs. This means that for optimal categorization, we should split our data based on the attribute given by the index at the value provided by the value.

Lesson Summary and Practice

Great work on diving into splits in Decision Trees! Today, we learned about the structure of a Decision Tree, understood the concept of data splits, and calculated the Gini Index to measure the quality of the splits. Most significantly, we have implemented our understanding by creating a split function in Python!

Up next, we have some hands-on exercises for you to apply these newly acquired skills!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.