Decision Trees: All about them!

8 min readMar 13, 2022

If the Simpsons can then you can!

Abstract

From the moment we wake up in the morning till the time we sleep we tend to follow a trend of binary decision making to complete our tasks. The sun has risen, wake up or not. The weather is cold, wear a sweater or not and other decisions like that. In this article, we will explore a similar approach to solve machine learning problems, the decision trees. The binary decision making corresponds to the CART( Classification and Regression Trees) inducers for decision trees.

Introduction

Decision Trees is an algorithm that works recursively to partition the feature space into mutually exclusive regions or hyperplanes. A decision tree can be thought of as a binary tree; it starts spanning out from a root node which initially splits the feature space into two regions( hence creating two children nodes) and continues this splitting until a stopping criterion is reached. The nodes that have parent nodes, as well as children nodes, are known as internal nodes. The nodes at the bottom which do not have any children nodes are called leaf nodes. The structure of a decision tree is shown in Figure 1.

The above is an overview of the basic structure of what a decision tree looks like. The root node and each internal node represents a binary decision/condition(an event that leads to only two outcomes) upon which the splits take place where if the condition is not true we move towards the left child of the node else towards the right child of the node. The binary decision is made for a feature. A spilling point is defined upon which we define the condition for the split. For example, if we take want to predict if it will rain on a particular day or not and take temperature as the independent variable where we say if the temperature of a particular day is greater than 20 degrees celsius then it will not rain else it will rain. Here notice the feature is temperature and the splitting point is 20 degrees celsius. Temperature is a numerical feature but the same logic can be applied to categorical features. A simple split on the temperature feature is shown in Figure 2.

Figure 2: Example of Splitting in a decision tree.

From the example shown in Figure 2, we can also say that if the temperature of a day is greater than or equal to 20 then it will not rain else it will rain.

This gives rise to three popular questions:

How to select the feature and the splitting point?
Where to stop the recursive, splitting?

It's time to address these questions and get into further details.

Feature and Splitting Point Selection

The recursive splitting starts from the root node, by selecting such a feature p and a corresponding splitting point s that optimizes a splitting criterion the most. That is why the decision tree algorithm is also known as a greedy algorithm, as it hunts down the features and splitting point which yields the best results for the current split and does not think about any further implication of the split on the tree down the road.

The splitting criterion is the metric on which we calculate the goodness of a split corresponding to a feature and its splitting points. The feature and the splitting point that best optimizes this metric is considered for the split. For a classification tree, we may consider the Gini index, entropy, classification error or the information gain as a splitting criterion whereas, in the case of regression trees, the mean squared error is widely chosen. We will now discuss these splitting criteria stated above for the classification trees:

A key point to note is the majority class present in the leaf is considered to be the output/prediction of that leaf.

1.) Classification Error

The classification error is the ratio of the number of minority class observations present in the leaf to the total number of observations present in the leaf. The classification error(for a region) is given by:

Figure 3: Feature Space Split on Temperature

When we split the feature space on temperature with a splitting point 20, it creates two regions namely Region 1 and Region 2. The observations with the colour green represent rain whereas purple represents no rain. Consider Region 2, the majority class is the green class and the minority class is the purple class. The output of a decision tree based on this split for any observation having a temperature less than 20 will be rain. Notice there is some misclassification in the region and the classification error for region 2 is 2/7 as it contains 7 observations in total out of which 2 are misclassified. Similarly, the classification error for Region 1 is 1/6. Notice the imbalance of data points in both the regions, hence we cannot add these errors straight away but we take a weighted average, that is defined as :

p is the feature, sp is the splitting point, N1 and N2 are the numbers of observations inside a region and CE is the classification error of a given region. We select a feature and a splitting point that minimises our weighted error.

2.) Gini Index

The Gini index calculates the impurity of a node. A pure node is where all the observations belong to the same class, whereas impure nodes contain observations from various classes. The Gini index for a region is calculated as:

where p is the feature, sp is the splitting point and the square of p(k) represents the probability of finding an observation of the kth class. For multiple regions we take the weighted average as before:

The square term is used for finding a softer maximum. if you re-write the equation of classification error for k classes, it will be:

Hence this is finding the true/hard maximum, while the square term in Gini impurity/index will be affected most by the class with the largest probability, hence we will be subtracting a softer maximum. In the case of having only two classes in our problem then a Gini index of 0 represents a pure node and a Gini index of 0.5 will represent an impure node with an equal number of observations belonging to each of the classes.

3.) Entropy

Originally used for error detections in digital signals, a random variable entropy conveys the uncertainty related to it. The larger the entropy, the less predictive our random variable becomes, the smaller the entropy, the more predictive our variable becomes. In general, we like our leaf nodes to be pure and hence more certainty is appreciated a low entropy is appreciated. The entropy in a region with k classes created after a split can be calculated as:

The log uses base 2 because entropy is originally calculated in bits. After all, digital signals travel in form of bits. Each bit can either be 0 or 1. The entropy value of 0 represents that the node is pure. We consider the weighted entropy to get a single value for both the regions combined and that is:

We choose the predictor and splitting point that minimizes the above error.

In the case of regression trees mean squared error is used as splitting criteria.

Stopping Criteria

We answered the first question in the previous section and this section will answer the second question. When do we stop this recursive process? The following are some of the indicators where you might want to stop:

The leaf node contains a specified minimum number of observations.
When the tree grows to a certain depth.
Splitting the nodes does not have an information gain greater than a predefined threshold; When you take the difference of the information gain of the parent/internal node and the child/leaf nodes is less than a given threshold.

A quick recap before we move on, We learnt how a CART algorithm inspired decision tree works, it recursively splits the nodes of the tree based on a splitting criterion for choosing which feature to split and a corresponding splitting point. The pair which minimizes the splitting criteria at each step is chosen. We stop this recursive process once we reach the stopping criteria. The predictions are made using the mode class in a leaf node for a classification problem and in a regression problem the predictions are made using the mean of the observations present in the leaf node.

Overfitting & Regularization

Like any other machine learning algorithm, a decision tree is prone to overfitting. It usually happens when the tree grows so deep that it starts fitting to noise. To solve this problem we have a regularizing technique called pruning. Pruning is the technique in which you let your tree grow as long as it can before reaching a stopping criterion. When it does, you cut out some of the branches so that your generalization error decreases.

We use the cost-complexity pruning method to prune/regularize our decision trees. This technique is similar to lasso regularization as it penalizes the absolute number of leaf nodes present in a tree. The following equation for the loss calculation of a regression tree will provide better insight,

where T0 is the number of leaf nodes in a decision tree. If alpha is equal to zero then we are considering the tree as it is and calculating the training loss. As the value of alpha increases, the branches of the tree starts getting pruned. Hence we obtain a subset tree that minimizes the generalization error. Notice that we do not want to try each and every combination of subtree possible that will just increase the time complexity of our decision tree instead we find the best value of alpha using k fold cross-validation.

This article was to make you familiar with the decision trees in the next article we shall solve a problem based on decision trees by hand as well as in python.