Building a Decision Tree from Scratch in Python

Lesson 6

Introduction

Greetings! In this session on Decision Trees, we aim to implement a full Decision Tree from scratch in Python. Decision Trees are a type of Supervised Machine Learning in which data is continuously split according to certain parameters.

Refreshing the Structure of Decision Tree

A Decision Tree has a tree-like structure with each internal node denoting a test on an attribute, each branch representing an outcome of the test, and each terminal node (leaf) holding a class label. Here are the parts of a Decision Tree:

Root Node: This houses the entire dataset.
Internal Nodes: These make decisions based on conditions.
Edges/branches: These connections implement decision rules.
Leaves: These are terminal nodes for making predictions.

Decisions on attributes depend on how well they help to purify the data.

The tree-building process begins with the full dataset at the root node, iteratively partitioning the data based on chosen attributes. Each child node becomes a new root that can be split further. This recursive process continues until predefined stopping criteria are met.

Stopping Criteria for Tree Building

Standard stopping criteria include:

Maximum Tree Depth: Limiting the maximum depth of the tree.
Minimum Node Records: No more partitioning if less than a threshold number of records.
Node Purity: Stop if all instances at a node belong to the same class.

These criteria ensure the model is consistent, which prevents overfitting.

Implementing Decision Tree Building in Python

Now, we'll use Python to build the decision tree. We'll rely on the existing get_best_split function from the previous lesson to find the optimal split for our data.

Here is how a terminal node is created:

Python
1def create_terminal(group):
2    outcomes = [row[-1] for row in group]
3    return max(set(outcomes), key=outcomes.count)

The create_terminal function determines the most common class value in a group of rows and assigns that value as the final decision for that subset of data.

Let's proceed to the actual tree building:

Python
1def build_tree(train, max_depth, min_size):
2    root = get_best_split(train)
3    recurse_split(root, max_depth, min_size, 1)
4    return root

This function begins the tree-building process.

Implementing the Recursive Split

The recurse_split function is responsible for creating children nodes:

Python
1def recurse_split(node, max_depth, min_size, depth):
2    # Split into left and right groups
3    left, right = node['groups']
4    del(node['groups'])
5
6    # If left or right groups are empty, create a terminal node
7    if not left or not right:
8        node['left'] = node['right'] = create_terminal(left + right)
9        return
10
11    # Check for max depth
12    if depth >= max_depth:
13        node['left'], node['right'] = create_terminal(left), create_terminal(right)
14        return
15
16    # Process the children nodes
17    if len(left) <= min_size:
18        node['left'] = create_terminal(left)
19    else:
20        node['left'] = get_best_split(left)
21        recurse_split(node['left'], max_depth, min_size, depth+1)
22
23    if len(right) <= min_size:
24        node['right'] = create_terminal(right)
25    else:
26        node['right'] = get_best_split(right)
27        recurse_split(node['right'], max_depth, min_size, depth+1)

Building and Printing a Tree

We can then build and print the decision tree based on the dataset and chosen parameters:

Python
1# Sample dataset
2dataset = [
3    [5, 3, 0], [6, 3, 0], [6, 4, 0], [10, 3, 1],
4    [11, 4, 1], [12, 8, 0], [5, 5, 0], [12, 4, 1]
5]
6
7max_depth = 2
8min_size = 1
9tree = build_tree(dataset, max_depth, min_size)
10
11# Print the tree
12def print_tree(node, depth=0):
13    if isinstance(node, dict):
14        print('%s[X%d < %.3f]' % ((depth*' ', (node['index']+1), node['value'])))
15        print_tree(node['left'], depth+1)
16        print_tree(node['right'], depth+1)
17    else:
18        print('%s[%s]' % ((depth*' ', node)))
19
20print_tree(tree)
21'''Output:
22[X1 < 10.000]
23 [X1 < 5.000]
24  [0]
25  [0]
26 [X2 < 8.000]
27  [1]
28  [0]
29'''

The print_tree output shows a decision tree. [X1 < 10.000] checks if feature X1 is less than 10.000. Left branch ([X1 < 5.000]) and its subsequent nodes ([0] and [0]) indicate the conditions and predictions if 'Yes'. The right branch ([X2 < 8.000]) and its nodes ([1] and [0]) cover the cases for 'No'. Indentations imply tree depth.

Lesson Summary

Congratulations! You've now built a decision tree from scratch in Python. Proceed to the practice to reinforce your understanding. Keep progressing and keep implementing! Happy tree building!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.