Skip to content Skip to footer

End-to-End Logistic Regression Modelling

In machine learning, Logistic regression algorithms are one of those basic models from which beginners start to learn classification modelling. Moreover, these algorithms are useful for modelling binary classification data. In one of our articles, we have already discussed how this algorithm work and how we should process it using synthetic data. In this article, we are going to use this algorithm with real-life datasets so that we can cover the following topics:

  • Data Exploration
  • Assumptions
  • Modelling
  1. Data processing
  2. Forward feature selection
  • Evaluation
  1. Confusion matrix
  2. Evaluation statistics
  3. ROC Curve
  4. Area Under the Curve

Let’s start with gathering data. For this article, we will use a heart disease data set that shows us how different factors make a person diagnosed with heart disease. Let’s import this dataset in the form of pandas DataFrame.

import warnings

warnings.filterwarnings(‘ignore’)

import pandas as pd

data = pd.read_csv(‘https://raw.githubusercontent.com/TarekDib03/Analytics/master/Week3%20-%20Logistic%20Regression/Data/framingham.csv')

data.head()

Output:

Here we can see different factors that cause heart diseases. Some of them are categorical, and some of them are continuous data. Below, we can see the meanings of these variables:

  • Demographic variables:
  1. Male: Boolean or categorical data
  2. Age: Numerical and continuous data
  • Behavioural variables:
  1. Current smoker: Tell about current smoking habits(boolean)
  2. cigsPerDay: Average number of cigarettes consumed by a person in a day.
  • Medical(history):
  1. BPMeds: Whether a person consumes BP(blood pressure) medication(boolean)
  2. prevalentStroke: Previous stroke records(boolean)
  3. prevalentHyp: Prevalent hypertensive history(boolean)
  4. Diabetes: Diabetes status(boolean)
  5. totChol: Total cholesterol level (Continuous)
  6. sysBP: Systolic blood pressure (Continuous)
  7. diaBP: Diastolic blood pressure (Continuous)
  8. BMI: Body Mass Index (Continuous)
  9. heartRate: Heart rate (Continuous)
  10. glucose: Glucose level (Continuous)
  • Target variables
  1. TenYearCHD(ten year risk of coronary heart disease)(Boolean)

In the above, we have seen that in the column male, we have values 0 and 1. Therefore, we are considering 1 as male and 0 as female and changing the data accordingly using the following lines of codes:

data[‘male’] = data[‘male’].map({1: ‘male’, 0: ‘female’})

data.rename(columns={‘male’:’Sex’},inplace = True)

Lets perform basic EDA part on data before modelling it./

Data Exploration

Let’s begin with knowing the shape of data.

print(‘shape of the data’, data.shape)

Output:

 

In the data, we have 4240 data points and 16 variables, out of which 15 are independent, and 1 is dependent.

Describing data

data.describe()

Output:

Here we can see the data distribution, but it will be much better to see it using visualisation because the data size is large. Before the visualisation part, we should check for the null values.

import missingno as ms

print(data.isna().sum())

ms.bar(data)

Output:

We can see that there are various null values present in multiple variables. Let’s see the total number of null values in the data.

print(‘null values percentage’,round(data.isna().sum().sum()/len(data.index)*100))

Output:

data.dropna(axis=0,inplace=True)

data.shape

Output:

Here we have completed the basic EDA of our dataset. Next, let’s understand data more using some visualisations.

Gender distribution

import plotly.express as px

fig = px.histogram(data, x=”Sex”, color=”Sex”,barmode=’group’,title = ‘gender distribution’,width = 600, height= 400)

fig.show()

Age distribution

import matplotlib.pyplot as plt

import seaborn as sns

fig, axs = plt.subplots(nrows=2, figsize=(8, 10))

sns.boxplot(data[‘age’],ax=axs[0]).set(title=’age distribution-boxplot’)

sns.distplot(data[‘age’],ax=axs[1]).set(title=’age distribution-histogram’)

plt.show()

Output:

In the above visualisation, we can see that most records are between the ages of 30 and 70, and the average age in the data is nearly 45 to 50.

Let’s see how the age variable is distributed according to gender.

fig = px.histogram(data, x=”age”, color=”Sex”, title=’Age distribution according to sex’,nbins = 60,width = 600, height= 400)

fig.show()

This visualisation represents how the age of different gender is distributed, and by looking at it, we can say the distribution of females between 45 to 50 is higher than men. So now let’s see how many males and females are addicted to cigarettes.

fig = px.histogram(data, x=”Sex”, color=”currentSmoker”,barmode=’group’,title = ‘Smoker according to gender’,width = 600, height= 400)

fig.show()

Output:

Here we can see that males are more addicted to cigarettes than the women. Now instead of plotting every feature from data we will try to merge them in one plot and try to understand how these features are related to each other.

fig, axis = plt.subplots(3,5,figsize=(20, 20))

data.hist(ax=axis)

Output:

In the above, if we compare all plots with the plot of TenYearCHD, we can find that the graph of Bpmeds is similar to the TenyearCHD, which means if someone is taking medicines for blood pressure, then there are chances for that person to be on the risk of coronary heart disease.

Also, the diabetes plot is similar to CHD and prevalentStroke, and prevalentHyp are pretty identical, and they can cause a significant effect on TenYearCHD.

Features with continuous data can explain themselves more using a correlation plot.

corrmat = data.corr()

fig = px.imshow(corrmat,text_auto=True,width = 700, height= 1000)

fig.show()

Output:

Here we can see that various features are highly correlated to each other, and when we look at the plot, these values are positively correlated. Values like glucose level and diaBP always correlate to each other.

Now here, our motive of data exploration is almost completed, and we should move toward the assumptions we consider in classification modelling.

Assumptions

In our last article, we have discussed five assumptions that we consider while using logistic regression for data modelling.

  1. Dependent variable with binary data points: In the above, we have already seen that TenYearCHD is binary data.
  2. Independent data points: Since collected data points are of different persons, there is no dependability between the data points.
  3. Linearity between the independent variable and their odds: This assumption can be validated using a regression plot of the independent (continuous) variable. we can do that using the seaborn library:

fig, axs = plt.subplots(ncols=2, nrows=4, figsize=(8, 10))

sns.regplot(x= ‘age’, y= ‘TenYearCHD’, data= data, ax=axs[0,0], logistic= True).set_title(“age Log Odds Linear Plot”)

sns.regplot(x= ‘cigsPerDay’, y= ‘TenYearCHD’, data= data, ax=axs[0,1], logistic= True).set_title(“cigsPerDay Log Odds Linear Plot”)

sns.regplot(x= ‘totChol’, y= ‘TenYearCHD’, data= data, ax=axs[1,0], logistic= True).set_title(“TotChol Log Odds Linear Plot”)

sns.regplot(x= ‘sysBP’, y= ‘TenYearCHD’, data= data, ax=axs[1,1], logistic= True).set_title(“sysBP Log Odds Linear Plot”)

sns.regplot(x= ‘diaBP’, y= ‘TenYearCHD’, data= data, ax=axs[2,0], logistic= True).set_title(“diaBP Log Odds Linear Plot”)

sns.regplot(x= ‘BMI’, y= ‘TenYearCHD’, data= data, ax=axs[2,1], logistic= True).set_title(“BMI Log Odds Linear Plot”)

sns.regplot(x= ‘heartRate’, y= ‘TenYearCHD’, data= data, ax=axs[3,0], logistic= True).set_title(“heartRate Log Odds Linear Plot”)

sns.regplot(x= ‘glucose’, y= ‘TenYearCHD’, data= data, ax=axs[3,1], logistic= True).set_title(“glucose Log Odds Linear Plot”)

# set the spacing between subplots

plt.subplots_adjust(left=0.1,

bottom=0.0,

right=0.9,

top=0.9,

wspace=0.4,

hspace=0.4)

fig.show()

Output:

  1. Here we are required to look at the shape of the plots. If plots resemble the s-shaped curve, we can assume that continuous independent variables are linearly related to their log odds. If any plot appears u-shaped, then we need to consider the data handling. In the above, every plot resembles the s-shaped or linear curve, so we don’t need to perform any data operation here.
  2. No Multicolinearity: No Multicollinearity: In the above, we have already seen a hugemulticollinearity between the independent continuous variable. We will deal With this multicollinearity later in the article.
  3. Enough data: Enough data: At the beginning of data exploration, we have seen more than 4000 data points available, which is quite enough to use logistic regression.

Lets move toward the next section.

Modelling

In the above sections, we have explored data and looked at the assumptions we need to consider while modelling data using linear regression. In this article, we will use a technique called forward feature selection with a linear regression model to help us select the right features for modelling data using logistic regression. But before performing forward feature elimination, we are required to perform some of the data processing. So let’s start with data processing.

Data preprocessing

Converting the categorical values into numerical values

data[‘Sex’] = data[‘Sex’].map({‘male’: 1, ‘female’: 0})

Splitting the data into test and train sets.

from sklearn.model_selection import train_test_split

X = data.iloc[:,:-1]

y = data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Here we have converted 75% data into train and the rest of the data into the test. Now move towards the forward feature elimination process, and Before going into implementation, let’s understand this technique.

Forward Feature Selection

As the name suggests, it is a feature selection technique work iteratively. In the first iteration, it starts by putting zero features in the model and with each iteration, it keeps adding the rest of the features from the data. Finally, in the end, testing every feature throws the best model in front of us. The below image represents the forward feature elimination.

Image source

Lets implement the forward feature elimination.

sfs = SequentialFeatureSelector(LogisticRegression(n_jobs = -1),

k_features = (1,14),

forward= True,

floating = False,

verbose= 2,

scoring= ‘accuracy’,

cv = 5).fit(X_train, y_train)

Output:

Here in the above, we can see the history of the feature elimination process. And now, we can extract more pieces of information from this module, like the optimised feature used table of tests. Etc. Let’s check them.

Index of selected features

sfs.k_feature_idx_

Output:

Names of the selected features

sfs.k_feature_names_

Output:

The Score of optimised model

sfs.k_score_

Output:

 

Here we have chosen this method to perform because the data we have has multicollinearity, and we are required to extract features that can predict better. We can also look at the whole procedure of feature forward elimination.

pd.DataFrame.from_dict(sfs.get_metric_dict()).T

Output:

Here we can see the accuracy of different models in the avg_score column. Let’s fit the logistic regression model with optimised features.

Extracting the features from the main data

new_features=data[[‘BPMeds’, ‘totChol’, ‘diaBP’, ‘heartRate’,’TenYearCHD’]]

X=new_features.iloc[:,:-1]

y=new_features.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Fitting the model and generating predictions based on the test set.

lro=LogisticRegression()

lro.fit(X_train,y_train)

y_pred=lro.predict(X_test)

After making predictions, we are required to evaluate the fitted model. the Logistic regression model can be evaluated in 4 methods. Let’s take a look at them.

Evaluation

Here we are going to evaluate our model using 4 methods:

  • Confusion matrix
  • Evaluation statistics
  • ROC curve (receiver operating characteristic curve)
  • AUC(area under the curve)

Let’s start with our first methods.

Confusion matrix

this matrix will let us know how many right and wrong predictions the model generates.

sklearn.metrics.plot_confusion_matrix(logreg, X_test, y_test)

Output:

Here we can see that (764 + 1) 765 values out of 915 are predicted right, and 150(148+2) are wrong. Based on the confusion matrix, we can evaluate our other evaluation statistics.

Evaluation statistics

TN=cm[0,0]

TP=cm[1,1]

FN=cm[1,0]

FP=cm[0,1]

sensitivity=TP/float(TP+FN)

specificity=TN/float(TN+FP)

print(‘The acuuracy of the model’,(TP+TN)*100/float(TP+TN+FP+FN),’% \n’,

‘The Missclassification percentage’,100-((TP+TN)*100/float(TP+TN+FP+FN)),’% \n’,

‘Sensitivity or True Positive Rate’,TP*100/float(TP+FN),’% \n’,

‘Specificity or True Negative Rate’,TN*100/float(TN+FP),’% \n’,

‘Positive Predictive value’,TP*100/float(TP+FP),’% \n’,

‘Negative predictive Value’,TN*100/float(TN+FN),’% \n’,

‘Positive Likelihood Ratio’,sensitivity/(1-specificity),’\n’,

‘Negative likelihood Ratio’,(1-sensitivity)/specificity)

Output:

Above are the evaluation statistics that tell us about the model’s accuracy, misclassification, positive and negative predicted percentage and other ratios. One thing I observed above is that model is more specific than sensitive. Also, it is more prone to calculate negative values correctly. So, let’s check the probabilities using which model is making predictions.

pred_prob=logreg.predict_proba(X_test)[:,:]

pred_prob_df=pd.DataFrame(data=pred_prob, columns=[‘Prob of predicting negative (0)’,’Prob of predicting possitive (1)’])

pred_prob_df.head()

Output:

In the above, threshold used by the model is 0.5(by default). Using this model should not be advisable because it is biased toward one class. We can iterate between other threshold values to make a better prediction.

pred_proba_df = pd.DataFrame(logreg.predict_proba(X_test))

threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]

for i in threshold_list:

print (‘\n******** For i = {} ******’.format(i))

y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)

test_accuracy = sklearn.metrics.accuracy_score(y_test.to_numpy().reshape(y_test.to_numpy().size,1),

y_test_pred.iloc[:,1].to_numpy().reshape(y_test_pred.iloc[:,1].to_numpy().size,1))

print(‘Our testing accuracy is {}’.format(test_accuracy))

cm2 = sklearn.metrics.confusion_matrix(y_test.to_numpy().reshape(y_test.to_numpy().size,1),

y_test_pred.iloc[:,1].to_numpy().reshape(y_test_pred.iloc[:,1].to_numpy().size,1))

print (‘With’,i/10,’threshold the Confusion Matrix is ‘,’\n’,cm2,’\n’,

‘with’,cm2[0,0]+cm2[1,1],’correct predictions and’,cm2[1,0],’Type II errors( False Negatives)’,’\n\n’,

‘Sensitivity: ‘,cm2[1,1]/(float(cm2[1,1]+cm2[1,0])),’Specificity: ‘,cm2[0,0]/(float(cm2[0,0]+cm2[0,1])),’\n\n\n’)

Output:

Part 1:

Part 2:

Here we can see that as the threshold gets lower, the model increases sensitivity, and after a threshold value, it is stable or decreases. So ROC curve can provide a more accurate picture.

ROC curve

This curve lets us know about the performance of classification models when all classification thresholds are considered. On a fundamental note, it shows a trade-off between sensitivity and specificity of the model.

 

 

y_pred2=logreg.predict_proba(X_test)

fpr, tpr, thresholds = sklearn.metrics.roc_curve(y_test, y_pred2[:,1])

plt.plot(fpr,tpr)

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.0])

plt.title(‘ROC curve for classifier’)

plt.xlabel(‘False positive rate (1-Specificity)’)

plt.ylabel(‘True positive rate (Sensitivity)’)

plt.grid(True)

Output:

 

Looking at the above, we can say that the optimum position for the ROC curve is at the top left corner because a model can be a good classifier only if it can predict more true positives than False positives. Let’s check the area under the curve.

Area Under the Curve(AUC)

The area under the curve can be considered representative of the model’s accuracy because the higher area shows us a higher disparity between the true and false positive and higher capability of the model to classify training data. However, a good measure of AUC starts from 0.5, and as it goes towards 1 it represents better classification.

print(‘Area Under The Curve(AUC): ‘,sklearn.metrics.roc_auc_score(y_test,y_pred2[:,1]))

Output:

Looking at the AUC, we can say our model is a moderate classifier.

Final words

In the above, we have seen how we can perform classification modelling using logistic regression modelling. During the process, we look at the data insights. After that, we used forward feature elimination to choose the right features from the data to get higher accuracy in modelling. We have also gone through Some of the assumptions we need to take care of during modelling and some of the evaluation techniques we can use to evaluate the model correctly. In the end, we can make following colclusion:

  • Forward feature elimination helped find significant features that can impact heart disease prediction.
  • Data analysis told us men are more addicted to cigarettes and susceptible to heart diseases.
  • Total cholesterol is an important predictor of heart disease.
  • The model we made is more specific than sensitive.
  • AUC is around 0.62, which is quite satisfactory.
  • The model requires more data from category one, so the class imbalance problem can be sorted.

References

link to the codes

Linear regression models/algorithms are one of the most basic techniques we use when the target variable is continuous data. On a basic level, we can say that these models are the points from where regression modelling starts. Using such a model, we can easily understand the procedures required to follow in regression modelling.

In one of our articles, we have already demystified linear regression and its nature of working using synthetic data. We have also discussed the assumptions we need to consider before modelling and how we evaluate the model using different metrics. This article will discuss how we can model real-world data with a linear regression model.

To complete this purpose, we have chosen the medical cost personal datasets to model using linear regression. The following points we will cover in this article:

  • Data Exploration
  • Assumption
  • Data Processing
  1. Data-Encoding
  2. Log Transformation
  3. Data Splitting
  • Model Building
  • Model Evaluation

Data Exploration

As we know, before modelling any data, understanding the data is very important. Data exploration is that step which can tell everything about the data. So let’s start with loading the data.

import pandas as pd

data = pd.read_csv(‘/content/insurance.csv’)

Checking the data.

data.head(10)

Output:

We can see that we have seven variables in the data, 3 out of 7 variables are categorical, and 4 are numerical data. Below is an explanation of these variables.

  1. Age: Age of the person.
  2. Sex: gender of the person.
  3. BMI: Body Mass Index
  4. Children: no. of the children a person has.
  5. Smoker: smoking habit(yes/no)
  6. Region: region from which the person belongs.
  7. Charges: medical cost.

In this dataset, we get the variable ‘charges’ as our dependent variable, and other variables are independent variables. So let’s get a more clarified view of the data.

print(‘shape of the data’, data.shape)

data.describe()

Output:

So we have 1338 rows and 7 variables or features which is the shape of the data and you can also see the distribution of the numerical variables through the basic statistics measures.

Let’s check for the null values.

import missingno as ms

ms.bar(data)

Output:

There is no null values in the data, and I have used a python package named missingno to visualise the null values. This package provides many functions to visualise and deal with null values. So let’s move to our next section.

From this section, we will try to understand data by visualising it. Let’s start with understanding the distribution of the ‘age’ variable.

import warnings

import seaborn as sns

warnings.filterwarnings(‘ignore’)

import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=2, figsize=(8, 10))

sns.boxplot(data[‘age’],ax=axs[0]).set(title=’age distribution-boxplot’)

sns.distplot(data[‘age’],ax=axs[1]).set(title=’age distribution-histogram’)

plt.show()

Output:

Here we can see that the age of the majority of people belongs to the range 30–50, wherein the above, we have already seen that the minimum age in the data is 18 and the maximum age is 64. Let’s check the age distribution according to gender.

import plotly.express as px

fig = px.histogram(data, x=”age”, color=”sex”, title=’Age distribution according to sex’,nbins = 60,width = 600, height= 400)

fig.show()

Output:

 

From the above graph, we can observe that the age of different gender people is distributed throughout the data. There are more than 60 people in the age range 18–20 and others between 20–40. Let’s move toward the following (BMI) variable.

The BMI distribution according to gender.

fig = px.histogram(data, x=”bmi”, color=”sex”,title=’bmi distribution according to sex’)

fig.show()

Output:

fig = px.histogram(data, x=”sex”, color=’children’, barmode=’group’,title = ‘sex vs children count’,width = 600, height= 400)

fig.show()

Output:

From the above visualisation, most of the people don’t have a child can be observed, and also in the above, we have seen that the data is mainly about people aged 18–20. Let’s move toward our target variable.

fig, axs = plt.subplots(nrows=2, figsize=(15, 15))

sns.boxplot(data[‘charges’],ax=axs[0]).set(title=’charges distribution-boxplot’)

sns.distplot(data[‘charges’],ax=axs[1]).set(title=’charges distribution-histogram’)

plt.show()

Output:

Looking at the above visualisation, we can say that the dependent variable is not distributed appropriately and is skewed on the right side.

fig = px.box(data, x=[‘charges’],y=data[‘sex’],color = ‘sex’,title = ‘charges vs sex’,width = 600, height= 400)

fig.show()

Output:

By looking at the above visualisation we can say that men’s charges are generally higher than women’s. Now let’s see how it changes with smoking habits.

fig = px.box(data, x=[‘charges’],y=data[‘smoker’],color = ‘sex’,title = ‘charges vs smoker’,width = 600, height= 400)

fig.show()

Output: 

It is visible now that medical charges for people who are habitual of smoking are higher than for those who are not smoking. Let’s analyse the charges with regions.

fig = px.box(data, x=[‘charges’],y=data[‘region’], color = ‘sex’,title = ‘charges vs region’,width = 600, height= 400)

fig.show()

Output:

It shows the same pattern of men having more charges than women. Let’s check the charges with BMI.

px.scatter(data, y = ‘charges’, x = ‘bmi’, color=’charges’, title = ‘charges vs bmi’)

Output:

Here we can see that there is a little correlation between the MI and charges. Now one thing which is pending to check for the correlation. Since we consider assumptions based on correlation in the case of linear regression, we will check it in our next section.

Assumption

In the last article, we discussed that there are mainly four assumptions associated with any linear regression model:

  • Linearity
  • Homoscedasticity
  • Independence
  • Normality

We can say that independent variables are collected independently by looking at the variables. But when things come on linearity, we need to check the correlation matrix. So let’s check it with a heat map visualisation of the data.

corrmat = data.corr()

fig = px.imshow(corrmat,text_auto=True)

fig.show()

Output:

No variables are highly correlated and one noticeable thing is dependent variables are also not correlated to the independent variable. But above, we have seen that sex and smoker variables correlate to the charges variable.

So here, three assumptions(linearity, normality, and independence) are verified. First the Homoscedasticity we consider at the time of modelling so we will take a look at this validation later. Now, we are ready to perform preprocessing of the data.

Data Preprocessing

this section will preprocess the data according to the Sklearn-given linear regression model requirement. The first step in data processing is converting non-numerical data into numerical data.

Data-Encoding

However, there are various ways to do so, but in this article, we will use the dummy variable method for converting the categorical data. Let’s see how we can do this.

encoded_data = pd.get_dummies(data = data, prefix = ‘OHE’, prefix_sep=’_’,

columns = [‘sex’,’children’, ‘smoker’, ‘region’],

drop_first =True,

dtype=’int8')

encoded_data

Output:

Now we have 13 columns in the encoded data set. The increased number of columns equals the total categories in the dataset.

Log transformation

Now let’s get back to the dependent variable. In the above visualisations, we can see that our dependent variable was not appropriately distributed. Therefore, in regression analysis, we mainly consider modelling the data when the dependent variable is approximately bell curve distributed to avoid the insensitivity of the model.

To improve the data distribution, we can take the help of log transformation. Let’s see how it works.

fig, axs = plt.subplots(nrows=2, figsize=(8, 10))

sns.distplot(data[‘charges’],ax=axs[0]).set(title=’charges distribution-histogram’)

encoded_data[‘charges’] = np.log(encoded_data[‘charges’])

sns.distplot(encoded_data[‘charges’],ax=axs[1]).set(title=’logtransformed charges distribution-histogram’)

fig.show()

Output: 

Above, we can see that our dependent variable is now skewed like before. Let’s split the data.

Splitting

Data splitting helps in the training and testing of the models. We will use sklearn provided train_test_split module to split our final data into test and train sets

from sklearn.model_selection import train_test_split

train, test = train_test_split(data_encode,test_size=0.2, random_state = 42)

X_train = train.drop(‘charges’, axis = 1)

X_test = test.drop(‘charges’, axis = 1)

y_train = train[‘charges’]

y_test = test[‘charges’]

Let’s check the shape of split datasets.

print(‘shape of train data’, X_train.shape, y_train.shape)

print(‘shape of test data’, X_test.shape, y_test.shape)

Output:  

Now we are ready to model our dataset using the linear regression algorithm.

Model Building

In the above section, we have seen how we have processed our data. Using this process data we are required to seek the optimal values of the hypothesis function parameters of linear regression model. As discussed in the last article the function of the linear regression model for this dataset will look as follows.

h(xi) = 0 +1 age + 2sex +3bmi + 4chidren + 5smoker+6region

Let’s have a look at the data where i = 1.

data.head(1)

If these are the values of variable we can write the hypothesis function as

h(xi) = 0 +1 19 + 2female +327.9 + 40 + 5yes+6southwest

Here we just need to find right values of all parameters(slope and intercept) which are represented as in the equation.

When using python we get many option for applying linear regression model on the data. In this article, we will look at the how we can fit linear regression model in our data using the sklearn library.

We simply get a function for applying linear regression model under the linear_model package of sklearn library. Let’s fit this function on our data.

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

lin_reg.fit(X_train,y_train)

Output:

Let’s see the values of intercept and slope we get.

parameter = [‘theta_’+str(i) for i in range(encoded_data.shape[1])]

X = encoded_data.drop(‘charges’,axis = 1)

columns = [‘intersect:x_0=1’] + list(X.columns.values)

skl_theta = [lin_reg.intercept_]+list(lin_reg.coef_)

parameter_df = pd.DataFrame({‘Parameter’:parameter,’Columns’:columns,’theta’:skl_theta})

parameter_df

Output:

The above DataFrame represents all values of intercepts and coefficient of our multiple linear regression model. Now we are required to evaluate our model.

Model Evaluation

To evaluate our model, we must make some predictions on test data using our model to evaluate our model.

y_pred = lin_reg.predict(X_test)

len(y_pred)

Output:

To evaluate the model, we can use two matrices

  1. MSE — mean squared error
  2. R-square value

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_pred, y_test)

print(‘The Mean Square Error(MSE) is: ‘,mse)

Output: 

This error parameter should be zero or closer to 0; here, we can see a very low MSE value. Now, let’s check the model with r-square or coefficient of determination.

R_square = lin_reg.score(X_test,y_test)

print(‘R square obtain for scikit learn library is :’,R_square)

Output:

The obtained value of r-squared is close to the optimal value range. Generally, the R-squared value for such data is optimal when it is more than 0.7, and we have it around 0.8, so we can infer our model is optimal for this data.

As discussed above, one primary assumption is left to validate. In statistics, homoscedasticity refers to the property of statistical variance. This term refers to a situation in which the variance of the dependent variable is the same for all the data variables. We can think of it as homogeneity of variance. To validate it, we can plot a scatter plot between residual and fitted values. If heteroscedasticity is available in the model, it will exhibit a funnel shape.

# Check for Linearity

f = plt.figure(figsize=(14,5))

ax = f.add_subplot(121)

sns.scatterplot(y_test,y_pred,ax=ax,color=’r’)

ax.set_title(‘Actual Vs Predicted value \n linearity test’)

# Check for Residual normality & mean

ax = f.add_subplot(122)

sns.distplot((y_test — y_pred),ax=ax,color=’b’)

ax.axvline((y_test — y_pred).mean(),color=’k’,linestyle=’ — ‘)

ax.set_title(‘ Residual error \n normality test’);

Output:

Also, applying a linear regression assumes that all the variables are multivariate and usually that can be validated using a Q-Q plot.

# Quantile-Quantile plot

f,ax = plt.subplots(1,2,figsize=(14,6))

import scipy as sp

_,(_,_,r)= sp.stats.probplot((y_test — y_pred),fit=True,plot=ax[0])

ax[0].set_title(‘Check for Multivariate Normality: \nQ-Q Plot’)

#Check for Homoscedasticity

sns.scatterplot(y = (y_test — y_pred), x= y_pred, ax = ax[1],color=’r’)

ax[1].set_title(‘Check for Homoscedasticity: \nResidual Vs Predicted’);

Output:

Here we can see that the predicted and actual values have an almost similar trend; the residual plot is slightly right skewed, and the linearity test represents heteroscedastic behaviours, so after a certain point, there will be increments in the errors.

Final word

In this article, we have performed multiple linear regression on medical cost personal data. We have also performed an optimal data exploration process to understand the data and data preprocessing to make data usable for a function given by the Sklearn library. Finally, we have considered all the assumptions for a linear regression model and evaluated the model using some important matrices and visualisation techniques.

References