Skip to content Skip to footer

Implementing a decision tree using Python and R

Table of Contents

Implementation of Decision Tree using the Python Programming Language

  1. Data splitting
  2. Importing and Fitting the Decision Tree Model
  3. Model Evaluation

Implementation of Decision Tree using the R Programming Language

  1. Data splitting
  2. Importing and Fitting the Decision Tree Model
  3. Model Evaluation

Implementation of a decision tree using the python programming language

To complete this motive of ours, we will take the help of the sklearn python library that will not only help us in fitting the model on data but also help in importing the iris data.

from sklearn import datasets

data = datasets.load_iris()

X = data.data

y = data.target

print(‘independent variables name \n’, data.feature_names)

print(‘shape of independent variables \n’, X.shape)

print(‘class names in target variables \n’,data.target_names)

print(‘shape of target variables \n’, y.shape)

Output:

  • Importing and fitting the decision tree model
  • Model evaluation

Data splitting

This step makes two sets of data ( train and test). Using the train set, we will train a decision tree mode and using the test set, we will evaluate the trained model. Let’s split the data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

Let’s check the shape of the spilted sets

print(“shape of train data”, X_train.shape, y_train.shape)

print(“shape of test data”, X_test.shape, y_test.shape)

Output:

Importing and Fitting the Decision Tree Model

This step will let us know how to fit the decision tree model on data. The point to be noticed here is that the model from sklearn takes a NumPy array form of data to train the model. Also, calling the data from the sklearn library comes as a NumPy array, so here we are not required to worry about any transformation. We can directly fit the split data. Let’s import and train the model.

from sklearn import tree

clf = tree.DecisionTreeClassifier(random_state=0).fit(X_train, y_train)

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10))

tree.plot_tree(clf, feature_names= data.feature_names)

plt.show()

Output:

prediction = clf.predict(X_test)

Model Evaluation

This section will use the accuracy score, f1_score and confusion matrix to evaluate the model. But, first, their definition is explained below.

Let’s calculate the above-defined scores and matrix.

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print(‘confusion matrix \n’, confusion_matrix(y_test, prediction))

print(‘accuracy score of our model \n’, accuracy_score(y_test, prediction))

print(‘f1 score of our model \n’,f1_score(y_test, prediction, average = ‘micro’))

Output:

Implementation of a Decision Tree using the R programming language

To work with the same data in the R programming language, we can use the datasets library. Using the below codes, we can get the Iris data.

ibrary(datasets)

data(iris)

head(iris)

Output:

Data splitting

To complete this step, we will use the caTools library.

library(caTools)

sample_data = sample.split(iris, SplitRatio = 0.8)

train_data <- subset(iris, sample_data == TRUE)

test_data <- subset(iris, sample_data == FALSE)

Importing and Fitting the Decision Tree Model

To complete this step, we will use the rpart library that allows us to fit the decision tree to any data. Using the below codes, we can call and train the model.

library(rpart)

clf <- rpart(formula = Species ~., data = train_data,

method = “class”,

control = rpart.control(cp = 0),

parms = list(split = “information”))

Let’s check the model by plotting it.

library(rpart.plot)

prp(clf, extra = 1, faclen=0, nn = T,

box.col=c(“green”, “red”))


Output:

One thing which we can also do here is to use the caret library so that we can check the importance of the feature/variable of our data in data modelling.

library(caret)

importances <- varImp(clf)

importances

Output:

prediction <- predict(clf, newdata = test_data, type = “class”)

prediction

Output:

Model Evaluation

Using only one line of codes we can evaluate our model against various matrices.

confusionMatrix(test_data$Species, prediction)

Output:

Final words

The decision tree can be interpreted as an excellent introductory model to the tree-based model family. We can also find its uses as a common baseline model for various models like random forest and gradient boosting.

References

About DSW

Data Science Wizards (DSW) is an Artificial Intelligence and Data Science start-up that primarily offers platforms, solutions, and services for making use of data as a strategy through AI and data analytics solutions and consulting services to help enterprises in data-driven decisions.