End-to-End Decision Tree Modelling

In one of our articles, we discussed the basics of decision tree algorithms, how it works, what it takes to make a decision tree and its terminology. We have discussed how such an algorithm works well without considering so much mathematics behind it. Also in one of our articles, we looked at its implementation using R and Python programming languages. In this article, we will look at how we can create a classification model on a real dataset using the decision tree algorithm. In the next steps, we will look at the following points.

Table of contents

Importing data
EDA
Data Processing
Modelling

Importing data

In this implementation, we are going to use the pumpkin seed classification data that is available in this link. Under the data, we have information about demographic information of seeds using which seeds are classified into two categories: Çerçevelik and Ürgüp Sivrisi. Let’s start our implementation by calling some useful python libraries and data.

Importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import plotly.express as px
import seaborn as sns

Importing data

data = pd.read_excel(‘/content/Pumpkin_Seeds_Dataset.xlsx’)
print(‘few lines of data \n’,data.head())

Output:

In the above output, we can see that there are various parameters are used to define the class of pumpkin seeds. Let’s move towards the next part of the article.

EDA

In this section, we will try to understand the insights of data. To do so let’s check the shape of the data.

print(‘shape of data \n’,data.shape)

Output:

In the data we have, we have 13 columns from which we have one target data and 12 independent variables. Let’s check the description of the data.

print(‘description of data \n’, data.describe)

Output:

Let’s check for the null values in the data.

print(‘Null values in data \n’,data.isnull().sum())

Output:

There is zero null values in the data so we don’t need to worry about null value analysis.

print(‘datatype in data \n’, data.info())

Output:

Here we can see that all independent variables are either in integer format or in float format. Checking the target variable distribution in the data.

data[‘Class’].value_counts().plot(kind = ‘pie’)

Output:

The data we have can be considered a balanced dataset because the number of data points for both classes is almost similar.

Checking the distribution of the Area variable against both classes of pumpkin seeds.

fig_size = (15,8)
plt.figure(figsize=fig_size)
sns.histplot(data = data , x = ‘Area’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Area Distribution’)
plt.show()

Output:

Here is the distribution of the area of pumpkin seeds, by looking at the above visualization we can say that the highest peak of area for both seeds is not equal, which represents a basic difference between the area of both types of seeds.

The area of Çerçevelikseeds is higher than the Ürgüp Sivrisi seeds. Similar observations can be made for all the variables. Let’s take a look:

Perimeter Distribution

fig_size = (15,8)
plt.figure(figsize=fig_size)
sns.histplot(data = data , x = ‘Perimeter’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Perimeter Distribution’)
plt.show()

Output:

It was obvious that the results will be the same as for the Area distribution. Let’s check for the distribution of the major and minor axis lengths.

fig_size = (15,8)
plt.figure(figsize=fig_size)
plt.subplot(1, 2, 1)
sns.histplot(data = data , x = ‘Major_Axis_Length’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Major Axis Distribution’)
plt.subplot(1, 2, 2)
sns.histplot(data = data , x = ‘Minor_Axis_Length’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Minor Axis Distribution’)
plt.show()

Output:

Again the observation is the same that the length of both axes, where Çerçevelikseeds axis lengths are higher. Now we can draw the same plot for the convex areas.

fig_size = (15,8)
plt.figure(figsize=fig_size)
sns.histplot(data = data , x = ‘Convex_Area’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Convex Area Distribution’)
plt.show()

Output:

let’s draw the distribution of eccentricity, solidity, extent and roundness.

fig_size = (15,8)
plt.figure(figsize=fig_size)
plt.subplot(2, 2, 1)
sns.histplot(data = data , x = ‘Eccentricity’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Eccentricity Distribution’)
plt.subplot(2, 2, 2)
sns.histplot(data = data , x = ‘Solidity’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Solidity Distribution’)
plt.subplot(2, 2, 3)
sns.histplot(data = data , x = ‘Extent’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Extent Distribution’)
plt.subplot(2, 2, 4)
sns.histplot(data = data , x = ‘Roundness’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Roundness’)
plt.tight_layout()
plt.show()

Output:

Now the observations are changed because eccentricity and solidity distributions are higher for the Ürgüp Sivrisi seeds and the remaining are similar to the others apart from eccentricity and solidity distributions.

fig_size = (15,8)
plt.figure(figsize=fig_size)
plt.subplot(1, 2, 1)
sns.histplot(data = data , x = ‘Aspect_Ration’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Aspect Ration Distribution’)
plt.subplot(1, 2, 2)
sns.histplot(data = data , x = ‘Compactness’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Compactness Distribution’)
plt.show()

Output:

Here again, aspect ratio and compactness distributions are higher for the Çerçevelik seeds. Let’s move toward the correlation analysis.

However, with decision tree algorithms, we don’t need to perform correlation analysis; to understand the data, we need to do so.

fig = px.imshow(data.corr())
fig.show()

Output:

The above plot represents how the continuous values are correlated to each other but the visualization is not clear here. So let’s drop some lower correlation values.

corr = data.corr().abs()
kot = corr[corr>=.5]
plt.figure(figsize=(12,8))
fig = px.imshow(corr[corr>=.5])
fig.show()

Output:

Here we can easily see the highly correlated values. Looking at the graph, we can say how two variables are highly correlated. For example, Area and convex Area are correlated. Now, it’s sufficient data analysis to understand the data, and we can move toward our next step.

Data Preprocessing

Since the data is not so difficult, we can take only two sub-steps to complete this step:

Label encoding
Data Splitting

Label Encoding: in this, we will label encode our class variables using the LabelEncoder function of sklearn in the following way:

from sklearn.preprocessing import LabelEncoder
label_encode = LabelEncoder()
data[‘Class’] = label_encode.fit_transform(data[‘Class’])
data[‘Class’]

Output:

Here we can see that class labels are converted into integer format.

Data splitting: here we will split the data into two subsets for training and testing purposes. Using the train_test_split we will split this data into a 70:30 ratio.

from sklearn.model_selection import train_test_split
X = data.iloc[:,:-1]
y = data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

Here our dataset is split into two subsets. Let’s move toward the next move.

Modelling

Training

In the above, we have made a train and test set of the data and here we are required to fit a decision tree model on the train data. This can be completed using the below lines of codes:

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
Lets plot the tree using which the clf object has learnt.
from sklearn.tree import plot_tree
plt.figure(figsize =(40,20))
plot_tree(clf, feature_names=X_train.columns, max_depth=3, filled=True)

Output:

We will need to zoom in on this plot to understand how the data set is split into root nodes and other nodes. The model we used is a simple decision tree model that has taken the aspect ratio as its root node. Let’s check the performance of the model.

pred = clf.predict(X_test)
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score, confusion_matrix
# Metrics
report = classification_report(y_test, pred)
cm = confusion_matrix(y_test, pred)
rfc_f1 = f1_score(y_test, pred)
rfc_Precision = precision_score(y_test, pred)
rfc_Recall = recall_score(y_test, pred)
# Show
print(report)
print(‘Confusion Matrix : \n{}\n’.format(cm))
print(“F1 Score : {:.5}%\n”.format(rfc_f1100))
print(“Precision Score : {:.5}%\n”.format(rfc_Precision100))
*print(“Recall Score : {:.5}%”.format(rfc_Recall100))

Output:

Here we can see without any modification on the base model, the model has performed well, which we can understand by seeing the accuracy, confusion matrix and F1 score.

Let’s try to improve the model’s performance using a grid search approach where we are required to give a set of model parameters. This approach will use all possible combinations of the model and will tell the best fit set of parameters for the model.

params = {
‘max_depth’: [2, 3, 5, 10, 20],
‘min_samples_leaf’: [5, 10, 20, 50, 100],
‘criterion’: [“gini”, “entropy”]
}
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator=clf,
param_grid=params,
cv=4, n_jobs=-1, verbose=1, scoring = “accuracy”)
grid_search.fit(X_train, y_train)

Output:

Here, we can see that we have fit 200 possible models. Let’s take a look at the scores of all the models.

score_df = pd.DataFrame(grid_search.cv_results_)
print(score_df.head())

Output:

Here we can see some results of the grid search approach. Using this data, we can select any of the models but here we will use only the best-fit set of parameters to fit the model. Using the below code we can do so.

grid_search.best_estimator_

Output:

Here, we get the values of the best fit model according to the grid search approach. Now let’s make a model using these parameters.

best_clf = DecisionTreeClassifier(criterion=’entropy’, max_depth=5, min_samples_leaf=5,
random_state=42)
best_clf.fit(X_train, y_train)
plt.figure(figsize =(40,20))
plot_tree(best_clf, feature_names=X_train.columns, max_depth=3, filled=True)

Output:

After zooming in on the plot we can see that this time decision tee has considered the compactness of the seed as its root node. let’s check how the performance of the model has increased this time.

pred = best_clf.predict(X_test)
# Metrics
report = classification_report(y_test, pred)
cm = confusion_matrix(y_test, pred)
rfc_f1 = f1_score(y_test, pred)
rfc_Precision = precision_score(y_test, pred)
rfc_Recall = recall_score(y_test, pred)
# Show
print(report)
print(‘Confusion Matrix : \n{}’.format(cm))
print(“F1 Score : {:.5}%\n”.format(rfc_f1100))*
print(“Precision Score : {:.5}%\n”.format(rfc_Precision100))*
print(“Recall Score : {:.5}%”.format(rfc_Recall100))*

Output:

Here we can clearly see that our model’s accuracy and F1 score has increased by 4% and there is more correction in the confusion matrix.

Final word

In this article, we looked at an example of working with a decision tree model to see how we can classify pumpkin seeds into two categories using different parameters of seeds. With this, we have performed the EDA step to understand the data. Decision tree algorithms are one of the most used algorithms in real-life use cases, and also a base model for many high-level models like the random forest, XGboost etc., so it helps understand many high-level models.

You can find our other articles in this link, where we talk about different algorithms, trends and uses of data science and artificial intelligence in real life. The code link, data link and reference articles links are given below for reference.

References

About DSW

Data Science Wizards (DSW) is an Artificial Intelligence and Data Science start-up that primarily offers platforms, solutions, and services for making use of data as a strategy through AI and data analytics solutions and consulting services to help enterprises in data-driven decisions.

DSW’s flagship platform UnifyAI is an end-to-end AI-enabled platform for enterprise customers to build, deploy, manage, and publish their AI models. UnifyAI helps you to build your business use case by leveraging AI capabilities and improving analytics outcomes.

Connect us at contact@datasciencewizards.ai and visit us at www.datasciencewizards.ai

Importing data

Importing libraries

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport sklearnimport plotly.express as pximport seaborn as sns

Importing data

data = pd.read_excel(‘/content/Pumpkin_Seeds_Dataset.xlsx’)print(‘few lines of data \n’,data.head())

EDA

print(‘shape of data \n’,data.shape)

print(‘description of data \n’, data.describe)

print(‘Null values in data \n’,data.isnull().sum())

print(‘datatype in data \n’, data.info())

data[‘Class’].value_counts().plot(kind = ‘pie’)

fig_size = (15,8)plt.figure(figsize=fig_size)sns.histplot(data = data , x = ‘Area’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Area Distribution’)plt.show()

fig_size = (15,8)plt.figure(figsize=fig_size)sns.histplot(data = data , x = ‘Perimeter’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Perimeter Distribution’)plt.show()

fig_size = (15,8)plt.figure(figsize=fig_size)sns.histplot(data = data , x = ‘Convex_Area’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Convex Area Distribution’)plt.show()

fig = px.imshow(data.corr())fig.show()

corr = data.corr().abs()kot = corr[corr>=.5]plt.figure(figsize=(12,8))fig = px.imshow(corr[corr>=.5])fig.show()

Data Preprocessing

from sklearn.preprocessing import LabelEncoderlabel_encode = LabelEncoder()data[‘Class’] = label_encode.fit_transform(data[‘Class’])data[‘Class’]

from sklearn.model_selection import train_test_splitX = data.iloc[:,:-1]y = data.iloc[:, -1]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

Modelling

Training

score_df = pd.DataFrame(grid_search.cv_results_)print(score_df.head())

grid_search.best_estimator_

best_clf = DecisionTreeClassifier(criterion=’entropy’, max_depth=5, min_samples_leaf=5,random_state=42)best_clf.fit(X_train, y_train)plt.figure(figsize =(40,20))plot_tree(best_clf, feature_names=X_train.columns, max_depth=3, filled=True)

Final word

References

About DSW

Connect

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import plotly.express as px
import seaborn as sns

data = pd.read_excel(‘/content/Pumpkin_Seeds_Dataset.xlsx’)
print(‘few lines of data \n’,data.head())

fig_size = (15,8)
plt.figure(figsize=fig_size)
sns.histplot(data = data , x = ‘Area’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Area Distribution’)
plt.show()

fig_size = (15,8)
plt.figure(figsize=fig_size)
sns.histplot(data = data , x = ‘Perimeter’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Perimeter Distribution’)
plt.show()

fig_size = (15,8)
plt.figure(figsize=fig_size)
sns.histplot(data = data , x = ‘Convex_Area’,hue = ‘Class’,multiple=’dodge’).set(title = ‘Convex Area Distribution’)
plt.show()

fig = px.imshow(data.corr())
fig.show()

corr = data.corr().abs()
kot = corr[corr>=.5]
plt.figure(figsize=(12,8))
fig = px.imshow(corr[corr>=.5])
fig.show()

from sklearn.preprocessing import LabelEncoder
label_encode = LabelEncoder()
data[‘Class’] = label_encode.fit_transform(data[‘Class’])
data[‘Class’]

from sklearn.model_selection import train_test_split
X = data.iloc[:,:-1]
y = data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

score_df = pd.DataFrame(grid_search.cv_results_)
print(score_df.head())

best_clf = DecisionTreeClassifier(criterion=’entropy’, max_depth=5, min_samples_leaf=5,
random_state=42)
best_clf.fit(X_train, y_train)
plt.figure(figsize =(40,20))
plot_tree(best_clf, feature_names=X_train.columns, max_depth=3, filled=True)