Skip to content Skip to footer

Getting Started with Machine Learning Models: Polynomial Regression

In a series of articles, we have already discussed how linear and logistic regression works. In this article, we will discuss the polynomial regression model. These models are pretty similar to linear regression because we use them for regression modelling as we use linear regression. The flexibility of the regression line makes the model different from linear regression, or we can say that this model uses a curve to model the data points.

There are various cases in real life where we don’t find linear regression useful because data doesn’t have a linear relationship between its variables but has a non-linear relationship. In such cases, the polynomial regression algorithm can be a valuable tool for modelling data.

In this article, we are going to discuss the following points on polynomial regression:

  • What is polynomial regression?
  • Why do we use polynomial regression?
  • Linear regression vs polynomial regression

What  is Polynomial Regression?

We can consider polynomial regression as a special case of linear regression model capable of predicting the relationship as an nth degree polynomial of a form of linear regression that can model the non-linear data points using some polynomial terms.

To understand its working, we can take an example of a dataset with two variables, where X is the independent and Y is the dependent variable. Here when we feed the data in a model, we transform the input variable into polynomial terms using the degree n, then we find X⁰, X¹, X²…..X^n, and these polynomial terms help to interpret the non-linear relationship between the variables. The mathematics behind the polynomial regression model is similar to the linear regression, and Mathematically, the polynomial regression equation can be represented as

We can also say that the degree of order for this model is a Hyperparameter, and when modelling with such a model, we need to choose it wisely. One major problem with such a model is that with a higher degree of polynomial tries to overfit, and for smaller values of a polynomial, it tries to underfit. So finding the optimal degree of the polynomial is the only challenging part of such modelling.

Right Degree of Polynomial can be Found Using Two Ways:

  • Forward selection: using this method, we increase the degree until we don’t find a best-fit model or optimal model.
  • Backward selection: using this method, we decrease the degree until we don’t find the best-fit or optimal model.

Before applying the polynomial regression model, we need to consider some of its assumptions of it as stated below:

Assumptions of Polynomial Regression

  • The behaviour of the dependent variable should be dependent(linear, curvilinear, additive etc.) on the relationship between the dependent and independent variables.
  • Independent variables need to be independent of each other.
  • Errors should be independent and normally distributed with constant variance and mean zero.

Note: when the degree of the polynomial in polynomial regression is one, the model is comparable to the linear regression model.

Why do we use Polynomial Regression?

When we look back at the last article, we find linear regression can be applied to datasets where values are linear, as given in the image below.

When we take examples of real-life datasets, we often find that values are not linearly separable. For example, data collected for salaries of different employees of different departments represents uneven variance. In such a situation dataset can look like the following:

Let’s say using such data points, we have drawn a regression model as given below:

Using this model, we predicted the salary of 6.5 years of experienced employees. Where we find that model predicts it is somewhere between 40,000 and 60,000. But looking at the data points, we can easily say that it is around 20,000. To fill this lag in prediction, we use the polynomial regression models.

Here we got to know whenever performing regression on data with a nonlinear relationship between variables, and we can use polynomial regression. Now let’s check how we can implement polynomial regression using python-programming language while comparing it with linear regression.

Linear Regression vs Polynomial Regression

In the above sections, we discussed polynomial regression and where we can use it. In this section, we will compare the effect of applying the linear regression model and the polynomial regression model on non-linear data. To do so, we will generate random data using the NumPy library. Let’s start with making data:

import numpy as np

X = np.arange(0, 13, 0.1).reshape(-1, 1)

Y = np.sin(X).ravel()

print(X)

Output:

Here in the above, we have made data in the sine form so that both variables can follow the below relationship.

Y = sin(X)

These variables don’t have linear-relationship. Let’s plot the data to verify this.

import matplotlib.pyplot as plt

plt.figure(1)

plt.axhline(y=0, color=’k’)

plt.grid(True)

plt.scatter(X,Y, color = ‘blue’)

plt.show


Output:

Here we can see that the data we have generated is following the sine-waveform. Now, let’s split the data into test and train data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42)

Now let’s fit the linear regression model and check the performance.

from sklearn.metrics import r2_score

from sklearn.linear_model import LinearRegression

model1 = LinearRegression().fit(X_train,y_train)

y_pred = model1.predict(X_test)

print(r2_score(y_test,y_pred))

Output:

plt.plot(X_test,y_pred, color = ‘g’)

plt.plot(X_test, y_test, “b.”)

Output:

In the above, we can clearly see how badly linear regression is performing on our data.

Now, let’s Model the data using Polynomial Regression.

In Python, there are various ways to implement polynomial regression. Still, in this article, we will use Sklearn’s PlonomialFeatures function to transform the data into polynomial form then we will model transformed data using the linear regression model.

Let’s start the procedure by transforming the data.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree = 5, include_bias = True)

X_train_transf = poly.fit_transform(X_train)

X_test_transf = poly.transform(X_test)

Let’s fit the linear regression model on transformed data and check the performance.

model2 = LinearRegression().fit(X_train_transf,y_train)

y_pred = model2.predict(X_test_transf)

print(‘R-square score=>’,r2_score(y_test,y_pred))

print(‘RMSE for Polynomial Regression=>’,np.sqrt(mean_squared_error(y_test,y_pred)))

Output:

Here we can see that our R-squared value is also improved by minimising the RMSE for this type of regression. Let’s plot the results.

plt.plot(X_test, y_pred, “r.”, linewidth=2, label=”Predictions”)

plt.plot(X_test, y_test, “b.”)

plt.xlabel(‘Predictions’,fontsize=12)

plt.ylabel(‘Target’,fontsize=12)

plt.legend([‘predictions’,’original’])

Output:

Here, we are getting predictions closer to the original values. To clarify, we can make another data and predict for that.

Creating another sine-wave data and using the above-trained polynomial regression model for predicting new values.

X_test = np.arange(0, 8, 0.1).reshape(-1, 1)

Y_test = np.sin(X_test).ravel()

poly = PolynomialFeatures(degree = 5, include_bias = True)

X_test_trans = poly.fit_transform(X_test)

y_pred_new = model2.predict(X_test_trans)

Let’s plot the predicted and actual values again.

plt.plot(X_test, y_pred_new, “r.”, linewidth=2, label=”Predictions”)

plt.plot(X_test, Y_test, “b.”)

plt.xlabel(‘Predictions’,fontsize=12)

plt.ylabel(‘Target’,fontsize=12)

plt.legend([‘predictions’,’original’])

Output:

Looking at the above scenarios, we can say that our linear model is improved from the plane linear regression models.

Final Words

In this article, we discussed the introduction to the polynomial regression model. We can think of it as an improvement on a simple linear regression model we use for regression modelling when dependent and independent variables are not linearly related but curvilinearly related. To check its capabilities, we used sine-waved data and tried to model it using linear and polynomial regression models. We found that at polynomial degree 5, we got a better fit model than the linear regression model.

References