In one of our articles, we have already discussed basic concepts hidden behind the decision trees, including the definitions of the decision trees, other core concepts and terminology we use with the algorithm.
As we have already discussed all the theoretical parts of the decision tree, we now need to understand how we can use this model practically. This article will be an extension of the above-given article, where we will discuss the implementation of a decision tree using the python and R programming languages. This article will cover the following topics:
Table of Contents
Implementation of Decision Tree using the Python Programming Language
- Data splitting
- Importing and Fitting the Decision Tree Model
- Model Evaluation
Implementation of Decision Tree using the R Programming Language
- Data splitting
- Importing and Fitting the Decision Tree Model
- Model Evaluation
Implementation of a decision tree using the python programming language
To complete this motive of ours, we will take the help of the sklearn python library that will not only help us in fitting the model on data but also help in importing the iris data.
With the iris data, we get the four continuous variables that include sepal length, sepal width, petal length, and petal width of the iris flowers and based on these variables or features of the data. Iris flowers are separated into three categories: Iris Setosa, Iris Versicolour, and Iris Virginica. Let’s import the data sets.
from sklearn import datasets
data = datasets.load_iris()
X = data.data
y = data.target
print(‘independent variables name \n’, data.feature_names)
print(‘shape of independent variables \n’, X.shape)
print(‘class names in target variables \n’,data.target_names)
print(‘shape of target variables \n’, y.shape)
Output:
In the data, we get 150 data points and four variables as discussed above.
Now to model this data using a decision tree, we will use the following steps:
- Data splitting
- Importing and fitting the decision tree model
- Model evaluation
Let’s start with data splitting.
Data splitting
This step makes two sets of data ( train and test). Using the train set, we will train a decision tree mode and using the test set, we will evaluate the trained model. Let’s split the data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
Let’s check the shape of the spilted sets
print(“shape of train data”, X_train.shape, y_train.shape)
print(“shape of test data”, X_test.shape, y_test.shape)
Output:
Importing and Fitting the Decision Tree Model
This step will let us know how to fit the decision tree model on data. The point to be noticed here is that the model from sklearn takes a NumPy array form of data to train the model. Also, calling the data from the sklearn library comes as a NumPy array, so here we are not required to worry about any transformation. We can directly fit the split data. Let’s import and train the model.
from sklearn import tree
clf = tree.DecisionTreeClassifier(random_state=0).fit(X_train, y_train)
The above code has called and trained the model using the train data. We can plot this tree to see how the splits worked with the iris data.
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 10))
tree.plot_tree(clf, feature_names= data.feature_names)
plt.show()
Output:
Here we can see that the in the root node of the decision tree, if the value of petal width is below or equal to 0.8, then iris has a class, and there are 37 samples of such data in whole train data. If the petal width is larger than 0.8 cm, then the iris flower is of a different class.
Let’s make predictions using the test data.
prediction = clf.predict(X_test)
Here in the prediction variable, we have values predicted by the model for the test data. Now, we can evaluate our model using the prediction set against the true values.
Model Evaluation
This section will use the accuracy score, f1_score and confusion matrix to evaluate the model. But, first, their definition is explained below.
Accuracy score: This gives the results based on the calculation of how many right predictions are made by the model compared to real data.
F1_score: This gives the harmonic mean of the precision and recall. Where precision can be interpreted as the right predicted positive values that belong to the positive class, and recall can be interpreted as the number of positive predicted values made out of all positive examples in the dataset. Mathematically,
F1 = 2 * (precision * recall) / (precision + recall)
Confusion matrix: this matrix represents how the model predicted the values in the below-given form of the matrix.
Let’s calculate the above-defined scores and matrix.
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
print(‘confusion matrix \n’, confusion_matrix(y_test, prediction))
print(‘accuracy score of our model \n’, accuracy_score(y_test, prediction))
print(‘f1 score of our model \n’,f1_score(y_test, prediction, average = ‘micro’))
Output:
Here we can see that there is only one value that the model has predicted wrong, and it has achieved a 97 % accuracy with a similar f1_score. Here this implementation is completed, and in the next section, we will perform the same operations using the R programming language.
Implementation of a Decision Tree using the R programming language
To work with the same data in the R programming language, we can use the datasets library. Using the below codes, we can get the Iris data.
ibrary(datasets)
data(iris)
head(iris)
Output:
Here we can see what how exactly our data looks like. Now we will follow the same steps as we followed using the Python programming language.
Data splitting
To complete this step, we will use the caTools library.
library(caTools)
sample_data = sample.split(iris, SplitRatio = 0.8)
train_data <- subset(iris, sample_data == TRUE)
test_data <- subset(iris, sample_data == FALSE)
Here we have split the data into an 80/20 ratio, where 80% of the data is from training, and 20% is for testing the model.
Importing and Fitting the Decision Tree Model
To complete this step, we will use the rpart library that allows us to fit the decision tree to any data. Using the below codes, we can call and train the model.
library(rpart)
clf <- rpart(formula = Species ~., data = train_data,
method = “class”,
control = rpart.control(cp = 0),
parms = list(split = “information”))
Let’s check the model by plotting it.
library(rpart.plot)
prp(clf, extra = 1, faclen=0, nn = T,
box.col=c(“green”, “red”))
Output:
One thing which we can also do here is to use the caret library so that we can check the importance of the feature/variable of our data in data modelling.
library(caret)
importances <- varImp(clf)
importances
Output:
Here we can see that the petal width is the most important variable in the training of the decision tree model.
Let’s make predictions from the model.
prediction <- predict(clf, newdata = test_data, type = “class”)
prediction
Output:
This is how our model has predicted on the test data.
Model Evaluation
Using only one line of codes we can evaluate our model against various matrices.
confusionMatrix(test_data$Species, prediction)
Output:
Here we have got most of the statistics which can be utilised to evaluate the model and we can also see that model has predicted only 1 wrong values and the accuracy of the model is around 97%.
Final words
The decision tree can be interpreted as an excellent introductory model to the tree-based model family. We can also find its uses as a common baseline model for various models like random forest and gradient boosting.
This article has looked at how we can implement a decision tree model using the python and R programming languages. With this, we have also looked at how we can draw and evaluate the model. Shortly we are going to cover all such kinds of models and concepts of machine learning and data science. To get all information, you can keep yourself connected to this link.
References
About DSW
Data Science Wizards (DSW) is an Artificial Intelligence and Data Science start-up that primarily offers platforms, solutions, and services for making use of data as a strategy through AI and data analytics solutions and consulting services to help enterprises in data-driven decisions.
DSW’s flagship platform UnifyAI is an end-to-end AI-enabled platform for enterprise customers to build, deploy, manage, and publish their AI models. UnifyAI helps you to build your business use case by leveraging AI capabilities and improving analytics outcomes.
Connect us at contact@datasciencewizards.ai and visit us at www.datasciencewizards.ai