Skip to content Skip to footer

Getting Started with Machine Learning Algorithms: Decision Trees

In a series of articles, we discussed how linear and logistic regression work. In this article, we will discuss the decision tree algorithms. These algorithms do not involve high-level mathematics behind them as linear and logistic regression. We can simply say that decision tree algorithms are based on splitting data to reach a final decision or prediction. These algorithms are mathematically simple and easy to interpret, which makes them one of the most used algorithms for data modelling irrespective of types of problem(classification or regression). Using this article, we will look at the following points:

Table of contents

  • What are decision tree algorithms?
  • What are Decision Tree Algorithms?
  1. Terminology
  2. Example of decision tree
  • Entropy
  1. How is entropy used in a decision tree
  • Gini index
  • Information gain
  • When to stop splitting
  • Pruning

Let’s start by understanding the decision tree algorithm.

What are Decision Tree Algorithms?

As already discussed, these machine learning algorithms can be used in several areas of data science. In the real world, we mainly find uses for these algorithms for classification and regression problems. These algorithms use tree-like structures to split the data, and the final predictions can be thought of as results from a series of feature-based splits. Like a tree, it starts with a root node, goes through multiple branch splits, and ends with decision leaves. The image below represents decision tree algorithms:

Terminology

Here, we need to understand the terminology we use with decision trees to understand the algorithm much better. However, it will also be the same for the random forest algorithms that can be thought of as extended versions of decision tree algorithms.

Root nodes: as the name suggests, this is a node from which the decision tree starts. While looking from the data point of view, it is the point from which data populations start dividing according to various features.

Decision nodes: As defined in the above picture, these are the node after the root nodes and are responsible for splitting the population into more nodes.

Leaf nodes: These are the final nodes of the decision tree, after which further splitting of the data population stops.

Sub-tree: Branches split from the root node can be considered the sub-tree from which the major population gets split.

Pruning: This is a process of trimming some branches and nodes that we use to avoid overfitting.

Here we have covered most of the terminologies we use with decision trees.

First, let’s look at the example to know how this algorithm works.

Example of Decision Tree

Just take a look at the below data/table:

In the above data, we have information about feathers, Finns and flying status, using which we are required to tell the animal name. In such cases, the decision tree works in an upside-down nature, which means the root node(feather) will be on top, and the leaves(name of the animal) will be the last. For example, as given in the below image:

The above example is simple, but when we go deeper into the algorithms, we find that concepts like information gain, entropy and Gini index come before us. These concepts are beneficial in decreasing uncertainty or disorders from the model.

For example, with large data sets, we may get confused about taking a feature as the root node and decision node, or we can also raise the question of where we should stop splitting. All uncertainty of modelling can be decreased by knowing these concepts. Let’s start with understanding entropy in decision trees.

Entropy:

In the above, we have seen that decision trees can be thought of as a bunch of if else statements and the confusion between if else becomes huge with large data sets. We can compare this confusion with uncertainty, and this uncertainty of the data s called entropy. We can also define entropy as the measure of disorder.

Let’s look at an example of an object far from 9 viewers. The viewers are asked to identify the object, 4 out of 9 identify it as a table, and the other identifies the object as a bed. Because of this situation, it becomes difficult to tell what object is situated far, and this situation can be called randomness.

The formula of entropy is as follows:

Where

p+ = probability of positive class

p– = probability of negative class

S = Training example.

Let’s see how entropy is used in decision trees.

How is entropy used in decision trees?

We can also compare entropy as a measure of impurity, and impurity is the degree of randomness. A situation can be called a pure sub-split if we get either yes or no.

Let’s say in a feature, we have ten yes and five no in the root node, and after the first split, we have one decision node with six yes and four no. another decision node with three yes and two no.

The above-given situation can not consider a pure sub-split because we have negative values in both decision nodes. So if we need to make a decision tree using such data, we are required to calculate the impurity of the split and once obtaining 100% purity, we can call the step a leaf node step or final step after which no split is required.

To calculate the impurity, we use the formula of entropy.

The above image tells us that our entropy for decision node 1 will be:

E = -(6\10)log2(6/10) — (4\10)log2(4/10)

E = -(0.6*-0.74) — (0.4*-1.32)

E = 0.444 +0.528

E = 0.972

Entropy for decision node two will be:

E = -(3\5)log2(3/5) — (2\5)log2(2/5)

E = -(0.6*-0.74) — (0.4*-1.32)

E = 0.444 +0.528

E = 0.972

In the above calculation, we can see that both nodes have the same entropy near one, and due to this, we can say that they also have similar low impurities because as the entropy increases, impurity decreases. Therefore, while deciding the entropy, we majorly focus on increasing the impurity.

Gini Index

When we talk about the Gini index, we find that somewhere its motive of work is similar to entropy. Both help finds the best feature for a split in the decision tree. The only difference between these two is the formula they use. The impurity of the feature after splitting the Gini index or Gini impurity can be calculated by the below formula:

Using this formula Gini index measures the impurity in randomly choosing an element, and it is preferable to use attributes with a lower Gini index or Gini impurity.

Let’s take a look at the below data/table:

In the above data, we can see that we have an equal proportion of negative and positive values. Using the Gini index, we randomly choose the values to categorise the attributes. Let’s say these values are as follows

  • Var_1: 6
  • Var_2: 2
  • Var_3: 2
  • Var_4: 9

According to these values

  • If var_1 < = 6 then result = positive: 5/6
  • If var_1 <= 6 then result = negative: 2/6

Gini(5,1) = 1- [(5/6)2 + (2/6)2 =0.19

  • If Var_1 > 6 then result = positive: 0/4
  • If Var_1 < 5 then result = negative: 4/4

Gini(1,9) = 1- [(0/4 +(4/4)2] = 0

Let’s calculate the Gini index using the above formula for var_1

Gini(target,var-1) = 6/10*(0.19) + 4/10*0 = 0.096

Similarly, the Gini index for other variables is as follows

Gini(target,Var_2 = 0.375

Gini(target,Var_3) = 0.375

Gini(target,Var_4) = 0

With these values we can design a decision tree as follows:

In the above, we can see that we started splitting with variable 4, and as the Gini index increases, the population will get split accordingly.

But this does not end because here, we get the entropy of a particular node. To get more splits, we must know about the parent node’s or child node’s impurity. Here information gain helps us in knowing about the parent node entropy. Let’s take a look at this concept.

Information gain

We mainly use information gain to measure the reduction of uncertainty when features are given. It also helps n deciding which attribute can be selected as the decision or root node.

The formula for information gain is as follows.

We can also think of it as entropy for the whole dataset. Let’s take an example of 60 people who are told to go and watch a movie and 32 out of 60 are going to watch the movie, and 28 of them are not.

So to measure or predict this, we have two main features

  • Vehicle: yes or no
  • Class: below middle class, lower middle class and upper middle class.

Using his two features, we can make our decision tree as follows.

let’s calculate the entropy

E(parent) = -(32\60)log2(32/60) — (28\60)log2(28/60) = 0.99

E(parent|Vehicle = ‘High’ = -(24\26)log2(24/26) — (2\26)log2(2/26) =0.39

E(parent|Vehicle = ‘low’ = -(8\34)log2(8/34) — (26/34)log2(26/34) =0.79

Let’s calculate the weighted entropy of each node:

E(parent|Vehicle) =(26/60)*0.39+ (34/60)*0.79

Accordingly, the information gained will be

Information gain =E(parent) — E(parent|vehicle)

Information gain = 0.99–0.62

Information gain = 0.37

In the above calculation, we can see that the parent entropy is near 1(0.99), which means if we take the vehicle as the root node, then the entropy of the dataset will decrease. If we follow similarly with the class feature, then using the below image:

We can say,

E(parent) = 0.99

E(parent|Class = ‘below middle class’ = -(14\16)log2(14/16) — (2\16)log2(2/16) =0.54

E(parent|Class = ‘lower middle class’ = -(8\20)log2(8/20) — (12/20)log2(12/20) =0.97

E(parent|Class = ‘Upper middle class’ = -(10\24)log2(10/24) — (14/24)log2(14/24) =0.79

Now let’s calculate the weighted average entropy of each node

E(parent|class) = (16/30)*0.54 + (20/60)*0.97+(24/60)*0.98 = 0.86

Here information gained will be

Information gain =E(parent) — E(parent|Class)

Information gain =0.99–0.86

Information gain = 0.13

Here we can see that reduction by class feature is 0.13 and the reduction by vehicle feature is 0.37. Hence we will select the vehicle feature to split nodes because it will have maximum information gain.

According to the above process, if we select the vehicle feature for making root nodes, then we can say there are higher chances that people who have vehicles will go to watch the movie. After the root node, we can choose the class feature to split the population further.

When to Stop Splitting?

The most basic question raised by beginners that when to stop the growth of the decision trees. Before answering it, we need to understand that as the tree size increases, the chances of overfitting will also increase. Hence there should be an optimal length of a decision tree.

In real-world problems, we are often required to deal with huge datasets that require a huge number of splits, and there are many ways to do that, like hyperparameter tuning, cross-validation, and gridsearchCV. In most of the modules used for building decision trees, we find the parameters max_depth, min_samples_split, min_sample_leaf and max_features. This parameter helps apply constraints over the decision tree after a pre-defined size tree reaches.

Pruning

It is also a way to avoid overfitting, especially for decision tree algorithms. In this procedure, we exclude some of the nodes and sub-nodes from the process. The non-significant nodes and sub-nodes can improve the chances of overfitting, and removing them can avoid overfitting and enhance the model’s accuracy. There are two main ways of pruning:

  1. Pre-pruning: This process cuts the trees when the trees are in the growing stages.
  2. Post-pruning: This process cuts the tree after it stops splitting. This is more significant than pre-pruning.

Final words

In this article, we have discussed a decision tree that follows very low mathematics behind still one of the reliable algorithms in the field of predictive analysis. However, since it differs from other algorithms, it has different terminologies. Moreover, it performs various actions on the data that split data and form a tree-like structure to reach the leaf node from the root node.

In the following article, we will discuss the implementation of decision trees. To check our different articles, please refer to this link.