In one of our articles, we discussed the basics of random forests, where we have seen how they work by ensembling various trees, what are its important features, hyperparameters, and their pros and cons. This article will show how a random forest algorithm will work with a real-life dataset. With the completion of this article, we will be discussing the following subtopics:
Table of Contents
- The Dataset
- Exploratory Data Analysis
- Data processing
- Data Modelling
- Model Evaluation
- To Do List
Let’s start with understanding the data.
The Dataset
To look deep into the subject, we choose to work with the health insurance cross-cell prediction data, which we can find here. Under the data, we get major information about the vehicle insurance acceptance record of more than 3.5 lakh customers. In against this acceptance, we get the demographic(gender, age, region, vehicle age, annual premium, etc.) information of the customers.
Using this dataset, our task is to make a model that can tell us which customers will be interested in buying vehicle insurance based on similar demographic information. We have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), and Policy (Premium, sourcing channel). Let’s check the data head.
import pandas as pd
import numpy as np
train_data = pd.read_csv(‘/content/drive/MyDrive/articles/12–2022/17–12–2022 to 24–12–2022/train.csv’)
train_data.head()
Output:
Here we can see the values we have provided to train a random forest model. In the above Response variable is our target variable where 1 : Customer is interested, 0 : Customer is not interested. Now, let’s move towards the first step of this modelling procedure, which is exploratory data analysis.
Exploratory Data Analysis
This step will let us know about the insights of vehicle insurance data so lets start with knowing the infomarmation which this data consists.
train_data.info()
Output:
By looking at the above output, we can say that this data has 9 categorical values, and with this, we can start plotting these values into graphs so that we can get more information out of this data.
Let’s begin with our target variable.
By looking at the above output, we can say that this data has 9 categorical values, and with this, we can start plotting these values into graphs so that we can get more information out of this data.
Let’s begin with our target variable.
Response valuable
train_data[‘Response’].value_counts().plot(kind = ‘bar’)
Output:
Here we can see that we had a huge customer response of 0, which means the majority of customers are not interested in buying vehicle insurance. Now it’s our target to understand how this response variable is dependent on other information given in the data.
Response with Gender
train_data[[‘Gender’, ‘Response’]].value_counts().plot(kind = ‘bar’, stacked = True, )
Looking at this chart, the responses from men are more than women, which is also similar to the positive response.
Response with Age
train_data[‘Age’].describe()
Here we can see that in the Age column, we have the minimum age of any customer is 20, and the highest age is 85 years. So better visualisation In this section, we are going to separate the age values by the range of 10, and we will be able to see how different age ranges of customers are responding.
bins = np.arange(1, 10) * 10
train_data[‘category’] = np.digitize(train_data.Age, bins, right=True)
counts = train_data.groupby([‘category’,’Response’]).Age.count().unstack()
print(counts)
counts.plot(kind=’bar’, stacked=True)
Here we can see that most of the records in the data are for the customer age range of 30–40 years, but proportion-wise, customers over 40 years old are more interested in buying vehicle insurance.
Response with driving license
train_data[[‘Driving_License’,’Response’]].value_counts().plot(kind = ‘bar’)
Output:
Here we can see that there are few records of customers with no driving license, and they also responded as no, which is fair enough.
Response with Region
counts = train_data.groupby([‘Region_Code’,’Response’]).Gender.count().unstack()
counts.plot(kind=’bar’, stacked=True, figsize=(35, 10))
Output
Here we can see the distribution of responses from customers according to the region, and by zooming in, we can see that region 28 holds the most number of records.
Previously Insured and Response
counts = train_data.groupby([‘Previously_Insured’,’Response’]).Gender.count().unstack()
print(counts)
counts.plot(kind=’bar’, stacked=True)
Output:
Here we can see that most of the positive responses were from customers who had not previously received their vehicle insurance.
Reponse with Vehicle Age
With this variable, we can extract information about the most positive responses with respect to vehicle age.
counts = train_data.groupby([‘Vehicle_Age’,’Response’]).Gender.count().unstack()
print(counts)
counts.plot(kind=’bar’, stacked=True)
With this output, we can assume that the top part of the data is covered by vehicles aged 0 to 2 years, and mostly positive responses are from the customers who have vehicles aged 1 to 2 years.
Response with Vehicle Damage
Here we take a look at how customers choose to buy insurance when their vehicle is damaged.
counts = train_data.groupby([‘Vehicle_Damage’,’Response’]).Gender.count().unstack()
print(counts)
counts.plot(kind=’bar’, stacked=True)
Output
Here we can see that most customers are ready to buy insurance for their vehicle when it is already damaged.
Annual Premium
Since this is a continuous value, we can draw a density plot of the annual premium and see its description to know its minimum, maximum and average value.
train_data[‘Annual_Premium’].describe()
train_data[‘Annual_Premium’].plot(kind = ‘kde’)
Here we can see that the minimum size of the annual premium is 2630, and the maximum is 540165, while the average value is around 30564.
Vintage
This column represents the count of days of a customer associated with the organisation.
train_data[‘Vintage’].describe()
train_data[‘Vintage’].plot(kind = ‘kde’)
Output
Here we can see that the minimum count of customer association days is 10 days, and the maximum is 299, while the average count is 154 days. So this means most of the customers have been associated with customers for the last 154 days.
Now that we have completed a basic exploratory data analysis, we will prepare it for the data modelling procedure.
Data processing
For modelling the data, we are going to use the scikit learn library that only works with the numerical values, and as we know, we have many string values in the data, so we will need to convert them into numerical data by label encoding we can do.
Label encoding
By looking at the data, we know that we have three variables that have categorical values in the form of the string value, so let’s convert these values.
train_data[‘Gender’]=train_data[‘Gender’].replace({‘Male’:1,’Female’:0})
train_data[‘Vehicle_Age’]=train_data[‘Vehicle_Age’].replace({‘< 1 Year’:0,’1–2 Year’:1, ‘> 2 Years’:2})
train_data[‘Vehicle_Damage’]=train_data[‘Vehicle_Damage’].replace({‘Yes’:1,’No’:0})
train_data.head()
Output:
Here we can see that all the values are in the numerical format while we have changed binary string values in the form of 0 and 1 binary integer values, and for the vehicle age variable, we have given 0, 1 and 2 as numerical categorical data points.
Now we would need to split this data into train and test sets so that we can evaluate the fitted model properly.
Data splitting
from sklearn.model_selection import train_test_split
X = train_data.iloc[:, 0:-1]
y = train_data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 4)
Here we split the data in 75:25 so that we can train a model using 75% of the data and evaluate it with 25% data. Next, let’s move towards the data modelling procedure.
Data Modelling
Using the below line of code, we can train a random forest model using our processed data.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train,y_train)Let’s make predictions from the model and plot it once to see whether the model is working well or not.
y_pred = model.predict(X_test)
y_prediction = pd.DataFrame(y_pred, columns = [‘predictions’])
y_prediction[‘predictions’].value_counts().plot(kind = ‘bar’)
Ouput:
Here we can see that model is making predictions for both categories. So now, let’s evaluate the model.
Model Evaluation
In the above, we have done the data modelling using the random forest algorithm. Now, we are required to perform a model evaluation to tell about our model’s reliability and performance. Using the below lines of codes, we can measure the performance of our model
from sklearn.metrics import mean_absolute_error,mean_squared_error,confusion_matrix,r2_score,accuracy_score,classification_report
print(“Classification Report:\n”,classification_report(y_test,y_pred))
print(“Confusion Matrix:\n”,confusion_matrix(y_test,y_pred))
print(“Training Score:\n”,model.score(X_test,y_test)*100)
print(“Mean Squared Error:\n”,mean_squared_error(y_test,y_pred))
print(“R2 score is:\n”,r2_score(y_test,y_pred))
print(‘model parameters \n’,model.get_params())
print(‘model accuracy \n’,accuracy_score(y_test,y_pred)*100)
Ouput:
Here we get most of the metrics in the final report that can be used for model evaluation, and looking at the report, we can say that our model is performing well with such huge data. However, we can make many improvements to the model, which we will discuss later in the article.
To learn about the evaluation metrics, we can go through this article, where we explain every critical metric we use in real life to evaluate such models.
To-Do List
In this procedure, we have performed every basic step which a data modelling procedure needs to go through, and below are the advanced steps we will perform to improve the results of this procedure:
- More EDA: as we can see in this article, we used only pandas for data visualisation, so in the next article, we will be using more visualisation libraries to perform EDA more properly.
- SMOTE analysis: in the data visualisation part, we can see that records for a positive response were too low, which can lead to biased modelling, so in the next article, we will see if we can improve the performance using SMOTE analysis.
- Cross Validation: We know that we got good enough results from the last modelling, and to improve the score, we can also use the cross-validation method so that model can become more reliable.
GridSearchCV: it is a method used for finding the optimal model when the model has too many parameters, and Random Forest is one of those models that can be modified by changing parameters.
References
- Getting Started with Machine Learning Algorithms: Random Forest
- Evaluation Metrics for Machine Learning or Data Models
- Link to the codes
About DSW
Data Science Wizards(DSW ) aim to democratize the power of AI and Data Science to empower customers with insight discovery and informed decision making.
We working towards nurturing the AI ecosystem with data-driven, open-source technologies and AI solutions and platform to benefit businesses, customers, communities, and stakeholders.
Through our industry-agnostic flagship platform, UnifyAI, we are working towards creating a holistic approach to data engineering and AI to enable companies to accelerate growth, enhance operational efficiency and help their business leveraging AI capabilities end-to-end.
Our niche expertise and positioning at the confluence of AI, data science, and open-source technologies helps us to empower customers with seamless and informed decision-making capabilities.
DSW’s key purpose is to empower enterprises and communities with ease of using AI, making AI accessible for everyone to solve problems through data insights.
Connect us at contact@datasciencewizards.ai and visit us at www.datasciencewizards.ai