Lesson 5

Welcome to another exciting lesson! Today, we will unlock the mechanism behind the key operation of the Decision Tree algorithm: `splitting`

. We will start with a glance at the structure of Decision Trees, understand the mechanics of splitting, and then dive into the application of the **Gini Index**, a measure of the quality of a split. Inspired by these insights, we will finish by creating a `split`

function using Python and running it on a sample dataset. So, let's roll up our sleeves and dig into the world of Decision Trees!

A Decision Tree is a tree-like graph that models decisions and their potential consequences. It starts at a single node, called the `root`

, which splits into `branches`

. Each branch, in turn, splits into more branches, forming a hierarchical network. The final branches with no further splits are referred to as `leaf nodes`

. Each split is determined by whether the data satisfies a specific condition.

For instance, if we build a Decision Tree to predict whether a patient will recover from a disease, the root could be `temperature> 101F`

. The tree would then split into two branches - one for `yes`

and another for `no`

. Each branch could further split based on another attribute, such as `cough present`

. This process of splitting continues until we conclude the leaf nodes, such as `recovery probable`

or `recovery doubtful`

. Isn't this a straightforward and intuitive way to make complex decisions?

Let's better understand the Gini Index concept using a tangible example. Imagine we have a basket full of socks of different colors, say red and blue. The goal of a Decision Tree in this context would be to split these socks into separate baskets (or groups) based on their colors. This process is essentially what the Gini Index endeavors to quantify -- the disorder within these groups. A greater Gini Index score signifies more disorder.

The formula is the following: $G = 1 - \sum_{i=1}^{n} p_i^2$

Where:

- $G$ is the Gini index or Gini coefficient,
- $p_i$ is the proportion of individuals in the $i$-th group, and the sum is taken over $n$ groups.

If we are successful in segregating the socks into two distinct piles, one with all red socks and the other with all blue socks, our groups are perfectly ordered, and our Gini Index is `0`

. However, in a case where red and blue socks are randomly mixed together, we have more disorder, resulting in a higher Gini Index.

Let's take a Pythonic approach to this.

First, assume that we have the following sample data of socks:

Python`1groups = [ 2 [['Red'], ['Blue'], ['Red']], 3 [['Blue'], ['Red'], ['Blue'], ['Blue']], 4] 5classes = ['Red', 'Blue']`

In Python, to calculate our Gini Index, we'll need to tackle several things. The first step is to count the total number of instances or socks:

Python`1n_instances = float(sum([len(group) for group in groups]))`

Next, we look at each group (a basket filled with socks of a specific color) and calculate their scores based on the number of each type of sock (class values). The more mixed the colors in a group, the higher the score.

Python`1for group in groups: 2 size = len(group) 3 score = 0.0 4 for class_val in classes: 5 p = [row[-1] for row in group].count(class_val) / size 6 score += p * p`

In the end, we adjust each group's score in relation to its size compared to the total number of socks. The final Gini Index is the accumulation of these weighted group scores.

Python`1gini += (1.0 - score) * (size / n_instances)`

This is how we compute the Gini Index in Python. In this specific example, the resulting `Gini`

index is `0.404`

. Remember, the lower the Gini Index, the better our split, just like how we'd prefer our socks sorted effectively by color!

With our Gini Index, we can decide where to split our data. Imagine we're sorting socks not only by color but also by size, material, and brand. Our `test_split`

function can help us break down the sock pile based on a specific characteristic.

Python`1def test_split(index, value, dataset): 2 left, right = list(), list() 3 for row in dataset: 4 if row[index] < value: 5 left.append(row) 6 else: 7 right.append(row) 8 return left, right`

Lastly, we merge `gini_index`

and `test_split`

to create the `get_split`

function. This function scans through all the attributes of our socks and finds the best attribute to split them by – which attribute makes the most distinct piles of socks.

Python`1def get_split(dataset): 2 class_values = list(set(row[-1] for row in dataset)) # Find unique classes in the dataset 3 b_index, b_value, b_score, b_groups = 999, 999, 999, None 4 for index in range(len(dataset[0])-1): # Exclude the last column which is the class 5 for row in dataset: 6 groups = test_split(index, row[index], dataset) 7 gini = gini_index(groups, class_values) 8 if gini < b_score: 9 b_index, b_value, b_score, b_groups = index, row[index], gini, groups 10 return {'index': b_index, 'value': b_value, 'groups': b_groups}`

The code keeps track of the best split column, best split value, a score of the best split, and groups of the best split using variables `b_index`

, `b_value`

, `b_score`

, and `b_groups`

, respectively.

We can test our split function this way:

Python`1# Age, Movie Genre, Decision (watch or not) 2 3dataset = [ 4 [18, 1, 0], 5 [20, 0, 1], 6 [23, 2, 1], 7 [25, 1, 1], 8 [30, 1, 0], 9] 10 11split = get_split(dataset) 12print('\nBest Split:') 13print('Column Index: %s, Value: %s' % ((split['index']), (split['value']))) 14# Output: Column Index: 0, Value: 20`

The initial dataset consists of the user's age, the movie's genre, and a decision of whether the user will watch the film. The get_split function is then run on the dataset, providing the best attribute to split the data. The `index`

represents the column in the dataset that provides the best split (either 0 for age or 1 for genre), and the `value`

is the specific value at which the split occurs. This means that for optimal categorization, we should split our data based on the attribute given by the `index`

at the value provided by the `value`

.

Great work on diving into splits in Decision Trees! Today, we learned about the structure of a Decision Tree, understood the concept of data splits, and calculated the Gini Index to measure the quality of the splits. Most significantly, we have implemented our understanding by creating a `split`

function in Python!

Up next, we have some hands-on exercises for you to apply these newly acquired skills!