Decision trees are one of the most basic and widely used machine learning algorithms, which fall under supervised machine learning techniques. Decision trees can handle both regression and classification tasks, and therefore learning decision trees is a must for those who aspire to be data scientists.
In this article, we will learn about decision trees, how to work with decision trees, and how to implement decision tree classification in R. We will also discuss the applications of decision trees along with their advantages and disadvantages.
Parametric vs Non-Parametric algorithm
Parametric algorithm makes strong assumption about the function. They learn a fixed number of parameters during training. Once the model trained, the model’s complexity doesn’t grow with more data.
For example- Linear regression makes assumes a linear relationship:
y=mx+b
The linear regression requires only two parameter m (slope) and b (intercept).
Non-Parametric do not assume a fixed form for the function. They can adapt based on the complexity of the data and number of parameters grows with the data. The algorithms are flexible and can capture complex pattern.
For example- Decision trees
What is a decision tree and how does it work?
Decision trees are a non-parametric form of supervised machine learning algorithm used for both classification and regression. As the name suggests, decision trees work by asking a Boolean form of question and, based on the answer, make a decision that goes further in the form of a tree, thus the name decision tree. The model further asks the questions until the prediction is made.
The goal of decision trees is to form a model that can predict the value of a target variable by learning through simple questions inferred from the data features.
Decision trees work in the format of a multiple if-else statement analogy where the trees grow through Boolean questions and stop only when the prediction is made.
Three main components of decision tree
- Root node (Parent node)
- Conditional node (child nodes)
- Leaf node (that does not have any children)

Various types of Decision tree algorithms in R
The decision tree algorithm in R is of various types, which differ in the way they function and form the tree while making the prediction. The four different forms of decision tree algorithms are ID3, C4.5, C5.0, and CART.
ID3
ID3 also known as Iterative Dichotomiser 3, which was developed by Ross Quinlan in 1986, is the first form of the decision tree that creates a multiway tree and works in a greedy manner.
At each node:
- The categorical feature that provides the highest Information Gain is selected.
It is specifically designed for problems with categorical target variables.
The tree is grown to its maximum size, meaning it continues splitting until it perfectly classifies the training data.
Pruning is later applied to:
- Reduce overfitting
- Improve generalization on unseen/test data.
C4.5
C4.5 is a successor to the ID3 model which has removed the restriction that features need to be categorical. This is done by the dynamic definition of a discrete attribute partitioning the continuous attribute value into intervals of discrete sets.
For example, it may split “Age” at 18, creating rules like “Age ≤ 18 → Pass” and “Age > 18 → Fail” based on information gain.
The c4.5 model converts the trees into a set of if-then rules. The accuracy of each if-then rule is evaluated, which determines the order in which they are applied. Pruning is performed by removing the precondition of the rule and checking if the accuracy of this rule improves without it.
C5.0
C5.0 is a modification to C4.5 and uses comparatively less memory to build smaller rulesets which are more accurate than C4.5.
CART
The logic of CART (Classification and Regression trees) is similar to C4.5 and differs in that it does not perform the computation of rule sets and supports numerical target variables. CART uses features and thresholds to construct the binary trees and yields the largest information gain at every node.
The depth of the decision tree is an important factor as the depth of the tree matters a lot in building a tree. This is done through the concepts of entropy and information gain.
Entropy
Entropy measures the uncertainty or impurity within the dataset and determines how a dataset is split through the decision tree. Entropy is also defined as the measure of disorder within the dataset, and the mathematical formula of entropy is:

Entropy is sometimes also denoted through ‘H’.
In the equation above, ‘pi’ is the probability of a class ‘i’ in the dataset.
‘i’ defined in the above equation can either be positive or negative depending on the class it belongs to.
Entropy is measured as a value between 0 and 1, whereas
- A value closer to 0 determines that the dataset is pure and contains fewer impurities.
- In contrast, a value closer to 1 means that the dataset is impure and has a high level of disorder.
- Sometimes the value of entropy can be greater than 1, which means that it has a very high level of disorder, and therefore, for the sake of simplicity, we can remember it as entropy between 0 and 1.
- If a dataset is having two classes and both the classes containing equal number of samples. Then in that case the entropy is 1.
- If a dataset is having only one type of samples in it. Then in that case the entropy is 0.
Information Gain
If the uncertainty present in the dataset is measured, then we need to measure the reduction of uncertainty in the target class of the dataset with the features available to us.
Information gain is used for measuring the amount of information a feature provides about a class and helps in determining the order of attributes.

Information gain is also known as Kullback-Leibler divergence and is denoted as IG(S, A), where S denotes the set and is the effective change in entropy after the decision has been made regarding a particular attribute denoted by A. Hence, information gain measures the relative change occurring in entropy as per the independent variable. The equation for information gain is given by:
IG (S, A) = H(S) – H(S , A)
This equation can be further extended as:

In this equation, IG (S, A) is the information gain for the set S, and H(S) is the entropy for the whole set, whereas the other term in this equation calculates the entropy after feature A is applied, and P(x) is the probability of x, which is an event.
How to calculate Entropy and Information gain in Decision trees
Let us now take an example to understand how we can calculate entropy and information gain using the formula that we discussed earlier.
Given a set of data and a decision tree needs to be drawn for this dataset, the very first thing that needs to be understood is how many attributes are there, what is the target class and whether it is a binary or a multi-valued classification. The dataset is given here:
Work Type | Smoking status | Gender | Stroke |
Private | never smoked | Male | Yes |
Self-employed | smokes | Male | No |
Self-employed | never smoked | Female | No |
Self-employed | smokes | Other | No |
Private | smokes | Female | Yes |
Other | smokes | Female | Yes |
Other | never smoked | Male | No |
These are the following steps that we are going to implement in order to build a decision tree.
- Calculate Entropy of the Target (Stroke)
- Calculate Entropy for each feature (Work Type, Smoking status, Gender)
- Calculate Information Gain
- Choose the feature with the highest Information Gain
Entropy of the Target Attribute (Stroke)
We calculate the entropy of the target variable (Stroke) before any splits:
Target distribution:
- Yes = 3
- No = 4

Entropy for Each Feature



Best Feature to Split
Feature | Info Gain |
---|---|
Work Type | 0.6995 |
Smoking Status | 0.0212 |
Gender | 0.1983 |
➡️ Best feature to split first is Work Type
Because it gives the highest Information Gain. After that you have to continue building the next levels of the tree based on this.
Implementation of Decision tree classification in R
There are various ways to implement the decision tree algorithm. One of the ways to implement a decision tree is using the rpart library.
The dataset is collected from the Kaggle website. It contains information about the patients who are having coronary heart disease.
The link to the dataset is https://www.kaggle.com/code/captainozlem/framingham-chd-preprossing-data/data
Prerequisite to run this code
You have to install two packages from the cran repository.
- install.packages(rpart)– This library is used to implement a decision tree
- install.packages(rpart.plot)– This library is used to plot the decision tree graph.
- install.packages(caTools)– This library is used to split the dataset into training and testing
Import the required libraries
library(rpart)
library(rpart.plot)
library(caTools)
Load the CSV dataset
data <- read.csv('CHD_preprocessed.csv')
Basic data exploration
#checking for the shape of the data
ndim(data)
# checking for the column in the dataset
str(data)
# summary statistics of the dataset
summary(data)
Splitting the dataset
set.seed(123)
#split the dataset into training and testing
split = sample.split(df$TenYearCHD, SplitRatio = 0.75)
training_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)'
Applying decision tree algorithms
tree <- rpart(TenYearCHD ~ ., data = training_set,method = 'class')
rpart.plot(tree)

Evaluating the model on the test set
prediction <- predict(tree, newdata = test_set,type = 'class')
cm <- table(test_set$TenYearCHD, prediction)
cm
accuracy <- sum(diag(cm)) / sum(cm)
print(paste('Accuracy on test data is ', accuracy))
>> Accuracy on the test data is 86%
Applications of Decision tree
The decision algorithms are suitable for both regression and classification purposes and, therefore, have a lot of applications in the real world. Some of the applications of decision trees are:
- Decision trees are easy to understand and implement; therefore, they are one of the best ways to start a machine learning career for beginners, and the fact it requires less data preparation makes them perfect for those who are new to data science.
- Decision trees can assess future growth opportunities as they can use historical data to evaluate the growth opportunities of several businesses.
- Decision trees can effectively use demographic data for finding prospective clients, which helps streamline a market budget and make informed decisions that are appropriate for the growth of the business.
- Decision trees can be used as an excellent support tool in various fields. Lenders can use it to predict the probability of a customer returning the loan amount through the predictive model generation using the historical data of characteristics of previous customers.
Advantages and disadvantages of decision trees
Some advantages of implementing decision trees are:
- Decision trees are simple to visualize and can be interpreted and understood easily.
- Decision trees require a less amount of data preparation as compared to other machine learning algorithms. However, decision trees are unable to handle the missing values.
- The cost of implementing a decision tree for prediction is logarithmic as per the number of data points used to train the decision tree.
- Decision trees are able to handle data in both the numerical and categorical forms, with exceptions in some of the implementations.
- Decision trees can work with problems concerning multi-output.
- Decision trees use a white box model
- The validation of models in decision trees is done through statistical tests, and thus, the reliability of the model is guaranteed.
- The decision tree model performs well even if the models’ assumptions are violated by the true model from where the data generation takes place.
Some disadvantages of implementing decision trees are:
- Decision trees are prone to overfitting when the learners create over-complex trees, which are not able to generalize data well. To avoid the overfitting of decision trees, several mechanisms such as pruning, setting the maximum depth of the decision tree, or setting a minimum number of samples at the leaf node are required to be applied.
- Decision trees are not good at extrapolation, which means that the predictions made through decision trees are neither continuous nor smooth.
- Decision trees are unstable as a small variation in the data set might result in the generation of a completely different tree. This problem can be dealt with through the ensemble technique of decision trees.
- Decision trees can create biased trees if some classes of the data are set to dominate. Therefore, the balancing of the dataset is required to fit with the decision trees.
- Decision trees are unable to express some of the concepts such as parity, multiplexer problems, or XOR, which makes it hard for decision trees to learn these concepts.
Conclusion
So in this blog, we have learned about what is decision tree and how to implement it in R programming languages. We have also talked about how to calculate entropy and information gain through a example.
If you like the article and would like to support me, make sure to:
I loᴠe it when folks come together and share opinions.
Great site, stick with it!
Thanks for appreciating.