Skip to content Skip to footer

End-to-End Random Forest Modelling

Table of Contents

  • The Dataset
  • Exploratory Data Analysis
  • Data processing
  • Data Modelling
  • Model Evaluation
  • To Do List

The Dataset

To look deep into the subject, we choose to work with the health insurance cross-cell prediction data, which we can find here. Under the data, we get major information about the vehicle insurance acceptance record of more than 3.5 lakh customers. In against this acceptance, we get the demographic(gender, age, region, vehicle age, annual premium, etc.) information of the customers.

import pandas as pd

import numpy as np

train_data = pd.read_csv(‘/content/drive/MyDrive/articles/12–2022/17–12–2022 to 24–12–2022/train.csv’)

train_data.head()

Output:

Exploratory Data Analysis

This step will let us know about the insights of vehicle insurance data so lets start with knowing the infomarmation which this data consists.

train_data.info()

Output:

train_data[‘Response’].value_counts().plot(kind = ‘bar’)

Output:

train_data[[‘Gender’, ‘Response’]].value_counts().plot(kind = ‘bar’, stacked = True, )


train_data[‘Age’].describe()

bins = np.arange(1, 10) * 10

train_data[‘category’] = np.digitize(train_data.Age, bins, right=True)

counts = train_data.groupby([‘category’,’Response’]).Age.count().unstack()

print(counts)

counts.plot(kind=’bar’, stacked=True)

train_data[[‘Driving_License’,’Response’]].value_counts().plot(kind = ‘bar’)

Output:

Here we can see that there are few records of customers with no driving license, and they also responded as no, which is fair enough.

Response with Region

counts = train_data.groupby([‘Region_Code’,’Response’]).Gender.count().unstack()

counts.plot(kind=’bar’, stacked=True, figsize=(35, 10))

Output

counts = train_data.groupby([‘Previously_Insured’,’Response’]).Gender.count().unstack()

print(counts)

counts.plot(kind=’bar’, stacked=True)

Output:

counts = train_data.groupby([‘Vehicle_Age’,’Response’]).Gender.count().unstack()

print(counts)

counts.plot(kind=’bar’, stacked=True)

counts = train_data.groupby([‘Vehicle_Damage’,’Response’]).Gender.count().unstack()

print(counts)

counts.plot(kind=’bar’, stacked=True)

Output

 

train_data[‘Annual_Premium’].describe()

train_data[‘Annual_Premium’].plot(kind = ‘kde’)


+6

train_data[‘Vintage’].describe()

train_data[‘Vintage’].plot(kind = ‘kde’)

Output

Data processing

For modelling the data, we are going to use the scikit learn library that only works with the numerical values, and as we know, we have many string values in the data, so we will need to convert them into numerical data by label encoding we can do.

train_data[‘Gender’]=train_data[‘Gender’].replace({‘Male’:1,’Female’:0})

train_data[‘Vehicle_Age’]=train_data[‘Vehicle_Age’].replace({‘< 1 Year’:0,’1–2 Year’:1, ‘> 2 Years’:2})

train_data[‘Vehicle_Damage’]=train_data[‘Vehicle_Damage’].replace({‘Yes’:1,’No’:0})

train_data.head()

Output:

from sklearn.model_selection import train_test_split

X = train_data.iloc[:, 0:-1]

y = train_data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 4)

Data Modelling

Using the below line of code, we can train a random forest model using our processed data.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

model.fit(X_train,y_train)

Let’s make predictions from the model and plot it once to see whether the model is working well or not.

y_pred = model.predict(X_test)

y_prediction = pd.DataFrame(y_pred, columns = [‘predictions’])

y_prediction[‘predictions’].value_counts().plot(kind = ‘bar’)

Ouput:

Model Evaluation

In the above, we have done the data modelling using the random forest algorithm. Now, we are required to perform a model evaluation to tell about our model’s reliability and performance. Using the below lines of codes, we can measure the performance of our model

from sklearn.metrics import mean_absolute_error,mean_squared_error,confusion_matrix,r2_score,accuracy_score,classification_report

print(“Classification Report:\n”,classification_report(y_test,y_pred))

print(“Confusion Matrix:\n”,confusion_matrix(y_test,y_pred))

print(“Training Score:\n”,model.score(X_test,y_test)*100)

print(“Mean Squared Error:\n”,mean_squared_error(y_test,y_pred))

print(“R2 score is:\n”,r2_score(y_test,y_pred))

print(‘model parameters \n’,model.get_params())

print(‘model accuracy \n’,accuracy_score(y_test,y_pred)*100)

Ouput:

To-Do List

In this procedure, we have performed every basic step which a data modelling procedure needs to go through, and below are the advanced steps we will perform to improve the results of this procedure:

  • SMOTE analysis: in the data visualisation part, we can see that records for a positive response were too low, which can lead to biased modelling, so in the next article, we will see if we can improve the performance using SMOTE analysis.
  • Cross Validation: We know that we got good enough results from the last modelling, and to improve the score, we can also use the cross-validation method so that model can become more reliable.
  • GridSearchCV: it is a method used for finding the optimal model when the model has too many parameters, and Random Forest is one of those models that can be modified by changing parameters.

References

About DSW

Data Science Wizards(DSW ) aim to democratize the power of AI and Data Science to empower customers with insight discovery and informed decision making.