In machine learning, Logistic regression algorithms are one of those basic models from which beginners start to learn classification modelling. Moreover, these algorithms are useful for modelling binary classification data. In one of our articles, we have already discussed how this algorithm work and how we should process it using synthetic data. In this article, we are going to use this algorithm with real-life datasets so that we can cover the following topics:
- Data Exploration
- Assumptions
- Modelling
- Data processing
- Forward feature selection
- Evaluation
- Confusion matrix
- Evaluation statistics
- ROC Curve
- Area Under the Curve
Let’s start with gathering data. For this article, we will use a heart disease data set that shows us how different factors make a person diagnosed with heart disease. Let’s import this dataset in the form of pandas DataFrame.
import warnings
warnings.filterwarnings(‘ignore’)
import pandas as pd
data = pd.read_csv(‘https://raw.githubusercontent.com/TarekDib03/Analytics/master/Week3%20-%20Logistic%20Regression/Data/framingham.csv')
data.head()
Output:
Here we can see different factors that cause heart diseases. Some of them are categorical, and some of them are continuous data. Below, we can see the meanings of these variables:
- Demographic variables:
- Male: Boolean or categorical data
- Age: Numerical and continuous data
- Behavioural variables:
- Current smoker: Tell about current smoking habits(boolean)
- cigsPerDay: Average number of cigarettes consumed by a person in a day.
- Medical(history):
- BPMeds: Whether a person consumes BP(blood pressure) medication(boolean)
- prevalentStroke: Previous stroke records(boolean)
- prevalentHyp: Prevalent hypertensive history(boolean)
- Diabetes: Diabetes status(boolean)
- totChol: Total cholesterol level (Continuous)
- sysBP: Systolic blood pressure (Continuous)
- diaBP: Diastolic blood pressure (Continuous)
- BMI: Body Mass Index (Continuous)
- heartRate: Heart rate (Continuous)
- glucose: Glucose level (Continuous)
- Target variables
- TenYearCHD(ten year risk of coronary heart disease)(Boolean)
In the above, we have seen that in the column male, we have values 0 and 1. Therefore, we are considering 1 as male and 0 as female and changing the data accordingly using the following lines of codes:
data[‘male’] = data[‘male’].map({1: ‘male’, 0: ‘female’})
data.rename(columns={‘male’:’Sex’},inplace = True)
Lets perform basic EDA part on data before modelling it./
Data Exploration
Let’s begin with knowing the shape of data.
print(‘shape of the data’, data.shape)
Output:
In the data, we have 4240 data points and 16 variables, out of which 15 are independent, and 1 is dependent.
Describing data
data.describe()
Output:
Here we can see the data distribution, but it will be much better to see it using visualisation because the data size is large. Before the visualisation part, we should check for the null values.
import missingno as ms
print(data.isna().sum())
ms.bar(data)
Output:
We can see that there are various null values present in multiple variables. Let’s see the total number of null values in the data.
print(‘null values percentage’,round(data.isna().sum().sum()/len(data.index)*100))
Output:
We can see only 15% of values are null in the dataset, so instead of focusing on null values, we can delete them from the dataset.
data.dropna(axis=0,inplace=True)
data.shape
Output:
Here we have completed the basic EDA of our dataset. Next, let’s understand data more using some visualisations.
Gender distribution
import plotly.express as px
fig = px.histogram(data, x=”Sex”, color=”Sex”,barmode=’group’,title = ‘gender distribution’,width = 600, height= 400)
fig.show()
Age distribution
import matplotlib.pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(nrows=2, figsize=(8, 10))
sns.boxplot(data[‘age’],ax=axs[0]).set(title=’age distribution-boxplot’)
sns.distplot(data[‘age’],ax=axs[1]).set(title=’age distribution-histogram’)
plt.show()
Output:
In the above visualisation, we can see that most records are between the ages of 30 and 70, and the average age in the data is nearly 45 to 50.
Let’s see how the age variable is distributed according to gender.
fig = px.histogram(data, x=”age”, color=”Sex”, title=’Age distribution according to sex’,nbins = 60,width = 600, height= 400)
fig.show()
This visualisation represents how the age of different gender is distributed, and by looking at it, we can say the distribution of females between 45 to 50 is higher than men. So now let’s see how many males and females are addicted to cigarettes.
fig = px.histogram(data, x=”Sex”, color=”currentSmoker”,barmode=’group’,title = ‘Smoker according to gender’,width = 600, height= 400)
fig.show()
Output:
Here we can see that males are more addicted to cigarettes than the women. Now instead of plotting every feature from data we will try to merge them in one plot and try to understand how these features are related to each other.
fig, axis = plt.subplots(3,5,figsize=(20, 20))
data.hist(ax=axis)
Output:
In the above, if we compare all plots with the plot of TenYearCHD, we can find that the graph of Bpmeds is similar to the TenyearCHD, which means if someone is taking medicines for blood pressure, then there are chances for that person to be on the risk of coronary heart disease.
Also, the diabetes plot is similar to CHD and prevalentStroke, and prevalentHyp are pretty identical, and they can cause a significant effect on TenYearCHD.
Features with continuous data can explain themselves more using a correlation plot.
corrmat = data.corr()
fig = px.imshow(corrmat,text_auto=True,width = 700, height= 1000)
fig.show()
Output:
Here we can see that various features are highly correlated to each other, and when we look at the plot, these values are positively correlated. Values like glucose level and diaBP always correlate to each other.
Now here, our motive of data exploration is almost completed, and we should move toward the assumptions we consider in classification modelling.
Assumptions
In our last article, we have discussed five assumptions that we consider while using logistic regression for data modelling.
- Dependent variable with binary data points: In the above, we have already seen that TenYearCHD is binary data.
- Independent data points: Since collected data points are of different persons, there is no dependability between the data points.
- Linearity between the independent variable and their odds: This assumption can be validated using a regression plot of the independent (continuous) variable. we can do that using the seaborn library:
fig, axs = plt.subplots(ncols=2, nrows=4, figsize=(8, 10))
sns.regplot(x= ‘age’, y= ‘TenYearCHD’, data= data, ax=axs[0,0], logistic= True).set_title(“age Log Odds Linear Plot”)
sns.regplot(x= ‘cigsPerDay’, y= ‘TenYearCHD’, data= data, ax=axs[0,1], logistic= True).set_title(“cigsPerDay Log Odds Linear Plot”)
sns.regplot(x= ‘totChol’, y= ‘TenYearCHD’, data= data, ax=axs[1,0], logistic= True).set_title(“TotChol Log Odds Linear Plot”)
sns.regplot(x= ‘sysBP’, y= ‘TenYearCHD’, data= data, ax=axs[1,1], logistic= True).set_title(“sysBP Log Odds Linear Plot”)
sns.regplot(x= ‘diaBP’, y= ‘TenYearCHD’, data= data, ax=axs[2,0], logistic= True).set_title(“diaBP Log Odds Linear Plot”)
sns.regplot(x= ‘BMI’, y= ‘TenYearCHD’, data= data, ax=axs[2,1], logistic= True).set_title(“BMI Log Odds Linear Plot”)
sns.regplot(x= ‘heartRate’, y= ‘TenYearCHD’, data= data, ax=axs[3,0], logistic= True).set_title(“heartRate Log Odds Linear Plot”)
sns.regplot(x= ‘glucose’, y= ‘TenYearCHD’, data= data, ax=axs[3,1], logistic= True).set_title(“glucose Log Odds Linear Plot”)
# set the spacing between subplots
plt.subplots_adjust(left=0.1,
bottom=0.0,
right=0.9,
top=0.9,
wspace=0.4,
hspace=0.4)
fig.show()
Output:
- Here we are required to look at the shape of the plots. If plots resemble the s-shaped curve, we can assume that continuous independent variables are linearly related to their log odds. If any plot appears u-shaped, then we need to consider the data handling. In the above, every plot resembles the s-shaped or linear curve, so we don’t need to perform any data operation here.
- No Multicolinearity: No Multicollinearity: In the above, we have already seen a hugemulticollinearity between the independent continuous variable. We will deal With this multicollinearity later in the article.
- Enough data: Enough data: At the beginning of data exploration, we have seen more than 4000 data points available, which is quite enough to use logistic regression.
Lets move toward the next section.
Modelling
In the above sections, we have explored data and looked at the assumptions we need to consider while modelling data using linear regression. In this article, we will use a technique called forward feature selection with a linear regression model to help us select the right features for modelling data using logistic regression. But before performing forward feature elimination, we are required to perform some of the data processing. So let’s start with data processing.
Data preprocessing
Converting the categorical values into numerical values
data[‘Sex’] = data[‘Sex’].map({‘male’: 1, ‘female’: 0})
Splitting the data into test and train sets.
from sklearn.model_selection import train_test_split
X = data.iloc[:,:-1]
y = data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
Here we have converted 75% data into train and the rest of the data into the test. Now move towards the forward feature elimination process, and Before going into implementation, let’s understand this technique.
Forward Feature Selection
As the name suggests, it is a feature selection technique work iteratively. In the first iteration, it starts by putting zero features in the model and with each iteration, it keeps adding the rest of the features from the data. Finally, in the end, testing every feature throws the best model in front of us. The below image represents the forward feature elimination.
Image source
Lets implement the forward feature elimination.
sfs = SequentialFeatureSelector(LogisticRegression(n_jobs = -1),
k_features = (1,14),
forward= True,
floating = False,
verbose= 2,
scoring= ‘accuracy’,
cv = 5).fit(X_train, y_train)
Output:
Here in the above, we can see the history of the feature elimination process. And now, we can extract more pieces of information from this module, like the optimised feature used table of tests. Etc. Let’s check them.
Index of selected features
sfs.k_feature_idx_
Output:
Names of the selected features
sfs.k_feature_names_
Output:
The Score of optimised model
sfs.k_score_
Output:
Here we have chosen this method to perform because the data we have has multicollinearity, and we are required to extract features that can predict better. We can also look at the whole procedure of feature forward elimination.
pd.DataFrame.from_dict(sfs.get_metric_dict()).T
Output:
Here we can see the accuracy of different models in the avg_score column. Let’s fit the logistic regression model with optimised features.
Extracting the features from the main data
new_features=data[[‘BPMeds’, ‘totChol’, ‘diaBP’, ‘heartRate’,’TenYearCHD’]]
X=new_features.iloc[:,:-1]
y=new_features.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
Fitting the model and generating predictions based on the test set.
lro=LogisticRegression()
lro.fit(X_train,y_train)
y_pred=lro.predict(X_test)
After making predictions, we are required to evaluate the fitted model. the Logistic regression model can be evaluated in 4 methods. Let’s take a look at them.
Evaluation
Here we are going to evaluate our model using 4 methods:
- Confusion matrix
- Evaluation statistics
- ROC curve (receiver operating characteristic curve)
- AUC(area under the curve)
Let’s start with our first methods.
Confusion matrix
this matrix will let us know how many right and wrong predictions the model generates.
sklearn.metrics.plot_confusion_matrix(logreg, X_test, y_test)
Output:
Here we can see that (764 + 1) 765 values out of 915 are predicted right, and 150(148+2) are wrong. Based on the confusion matrix, we can evaluate our other evaluation statistics.
Evaluation statistics
TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1]
sensitivity=TP/float(TP+FN)
specificity=TN/float(TN+FP)
print(‘The acuuracy of the model’,(TP+TN)*100/float(TP+TN+FP+FN),’% \n’,
‘The Missclassification percentage’,100-((TP+TN)*100/float(TP+TN+FP+FN)),’% \n’,
‘Sensitivity or True Positive Rate’,TP*100/float(TP+FN),’% \n’,
‘Specificity or True Negative Rate’,TN*100/float(TN+FP),’% \n’,
‘Positive Predictive value’,TP*100/float(TP+FP),’% \n’,
‘Negative predictive Value’,TN*100/float(TN+FN),’% \n’,
‘Positive Likelihood Ratio’,sensitivity/(1-specificity),’\n’,
‘Negative likelihood Ratio’,(1-sensitivity)/specificity)
Output:
Above are the evaluation statistics that tell us about the model’s accuracy, misclassification, positive and negative predicted percentage and other ratios. One thing I observed above is that model is more specific than sensitive. Also, it is more prone to calculate negative values correctly. So, let’s check the probabilities using which model is making predictions.
pred_prob=logreg.predict_proba(X_test)[:,:]
pred_prob_df=pd.DataFrame(data=pred_prob, columns=[‘Prob of predicting negative (0)’,’Prob of predicting possitive (1)’])
pred_prob_df.head()
Output:
In the above, threshold used by the model is 0.5(by default). Using this model should not be advisable because it is biased toward one class. We can iterate between other threshold values to make a better prediction.
pred_proba_df = pd.DataFrame(logreg.predict_proba(X_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
print (‘\n******** For i = {} ******’.format(i))
y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
test_accuracy = sklearn.metrics.accuracy_score(y_test.to_numpy().reshape(y_test.to_numpy().size,1),
y_test_pred.iloc[:,1].to_numpy().reshape(y_test_pred.iloc[:,1].to_numpy().size,1))
print(‘Our testing accuracy is {}’.format(test_accuracy))
cm2 = sklearn.metrics.confusion_matrix(y_test.to_numpy().reshape(y_test.to_numpy().size,1),
y_test_pred.iloc[:,1].to_numpy().reshape(y_test_pred.iloc[:,1].to_numpy().size,1))
print (‘With’,i/10,’threshold the Confusion Matrix is ‘,’\n’,cm2,’\n’,
‘with’,cm2[0,0]+cm2[1,1],’correct predictions and’,cm2[1,0],’Type II errors( False Negatives)’,’\n\n’,
‘Sensitivity: ‘,cm2[1,1]/(float(cm2[1,1]+cm2[1,0])),’Specificity: ‘,cm2[0,0]/(float(cm2[0,0]+cm2[0,1])),’\n\n\n’)
Output:
Part 1:
Part 2:
Here we can see that as the threshold gets lower, the model increases sensitivity, and after a threshold value, it is stable or decreases. So ROC curve can provide a more accurate picture.
ROC curve
This curve lets us know about the performance of classification models when all classification thresholds are considered. On a fundamental note, it shows a trade-off between sensitivity and specificity of the model.
y_pred2=logreg.predict_proba(X_test)
fpr, tpr, thresholds = sklearn.metrics.roc_curve(y_test, y_pred2[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title(‘ROC curve for classifier’)
plt.xlabel(‘False positive rate (1-Specificity)’)
plt.ylabel(‘True positive rate (Sensitivity)’)
plt.grid(True)
Output:
Looking at the above, we can say that the optimum position for the ROC curve is at the top left corner because a model can be a good classifier only if it can predict more true positives than False positives. Let’s check the area under the curve.
Area Under the Curve(AUC)
The area under the curve can be considered representative of the model’s accuracy because the higher area shows us a higher disparity between the true and false positive and higher capability of the model to classify training data. However, a good measure of AUC starts from 0.5, and as it goes towards 1 it represents better classification.
print(‘Area Under The Curve(AUC): ‘,sklearn.metrics.roc_auc_score(y_test,y_pred2[:,1]))
Output:
Looking at the AUC, we can say our model is a moderate classifier.
Final words
In the above, we have seen how we can perform classification modelling using logistic regression modelling. During the process, we look at the data insights. After that, we used forward feature elimination to choose the right features from the data to get higher accuracy in modelling. We have also gone through Some of the assumptions we need to take care of during modelling and some of the evaluation techniques we can use to evaluate the model correctly. In the end, we can make following colclusion:
- Forward feature elimination helped find significant features that can impact heart disease prediction.
- Data analysis told us men are more addicted to cigarettes and susceptible to heart diseases.
- Total cholesterol is an important predictor of heart disease.
- The model we made is more specific than sensitive.
- AUC is around 0.62, which is quite satisfactory.
- The model requires more data from category one, so the class imbalance problem can be sorted.