Polynomial Regression with K-fold Cross-Validation
Hey, I am back, I hope you guys are doing well. I have an interesting topic today we will not only learn to develop a new model but how to choose the right one too.
We’ll cover the following topics today:
- What’s Polynomial Regression?
- Introduction to k-fold Cross-Validation
- Use case and Implementation in Python
Let’s get started!
1. What’s Polynomial Regression?
Polynomial Regression is a statistical technique to predict a continuous variable(response variable) taking in account the higher power of the predictor variable when the relationship between the predictor and response is not linear. It is a special case of Multi-Linear Regression. Wait, what ? How can it be part of linear regression when it is considering higher powers of the predictor variable ? The answer is of a very basic understanding:
The term linear in linear regressions applies on the parameters; beta0, beta1 etc, not on the predictors
Also, all the higher powers of predictors are treated as a new predictor; if we have a predictor x and response y whose relationship with each other is not exactly linear and we wish to develop a model with higher power of x then for instance x² will be treated as another predictor and will have different regression coefficient. We’ll have a look at the equation and get a clearer picture.
We are only considering one predictor x here and as we include higher powers, a new predictor is formed.
Note: i represent the ith observation here and k represents the kth power of the polynomial.
Now is the correct time to answer, how it is a special case of Multi-Linear Regression? Every new higher degree term added to the polynomial makes a new predictor and hence we have multiple predictors and we can substitute the higher powers with dummy variables and the problem becomes exactly similar to multilinear regression. This will make it clear;
The beta values can be calculated as same as in multi-linear regression and I hope this is clear now. Any doubts, feel free to ask in the comment section.
2. Introduction to k-fold Cross-Validation
k-fold Cross Validation is a technique for model selection where the training data set is divided into k equal groups. The first group is considered as the validation set and the rest k-1 groups as training data and the model is fit on it. This process is iteratively repeated for another k-1 time and each time the kth group will be selected as validation and the rest k-1 groups will act as a validation group. In each iteration, the validation MSE is calculated and the final MSE after k iterations the Cross-Validation MSE is given as:
This validation MSE is the estimate for our test data MSE. You might have noticed that I mentioned in the definition of k-fold cross validation that we split the training data set that is because validation is just for model selection but with the test data set we will measure the performance of our model, so keep that in mind. You can keep only 10% of your data as a test set and give an ample amount of data for training.
I will discuss further in Model selection with Cross-Validation in my next article, where we will discuss more it but for now, we’ve had an introduction to it.
3. Use case and Implementation in Python
We’ll not work on an external data set today, we’ll generate one. I’ll tell you how but first things first and what is it? Yes, importing necessary libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
Great! Now let’s generate our data and to do so will use NumPy and generate a random signal, if you want you can use a data set or your own random signal:
x=np.linspace(0,2,300)
x=x+np.random.normal(0,.3,x.shape)
y=np.cos(x)+2*np.sin(x)+3*np.cos(x*2)+np.random.normal(0,1,x.shape)
Notice that I have also added some white Gaussian noise here to make it scattered otherwise it’s just a perfect curve. So what are we waiting for, let’s quickly plot our data and have a sneak peek,
fig,ax=plt.subplots(figsize=(6,6))
ax.scatter(x,y)
ax.set_xlabel('X Values',fontsize=20)
ax.set_ylabel('cos(x)+2sin(x)+3cos(x^2)',fontsize=20)
ax.set_title('Scatter Plot',fontsize=25)
Hmmm, looks a bit like duck but okay!
Let’s have an initial estimate, What do you think would be the right degree for the polynomial? Comment it down below. I would choose 3 as it have two inflexions that’d be my estimate. Let’s code it the real one out, eh?
Now we split our data and keep 20% data for test and the rest as training and we will use cross-validation for model selection on it but before that just reshape your signal as a 2D vector is expected by the regression function for predictors.
x=x.reshape((300,1))
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.8)
Awesome, Now let’s find out which model to choose:
## Let's find out the model we need to select
maxdegree=7 # The maximum degree we would like to test
training_error=[]
cross_validation_error=[]
for d in range(1,maxdegree):
x_poly_train=PolynomialFeatures(degree=d).fit_transform(x_train)
x_poly_test=PolynomialFeatures(degree=d).fit_transform(x_test)
lr=LinearRegression(fit_intercept=False)
model=model.fit(x_poly_train,y_train)
y_train_pred=model.predict(x_poly_train)
mse_train=mean_squared_error(y_train,y_train_pred)
cve=cross_validate(lr,x_poly_train,y_train,scoring='neg_mean_squared_error',cv=5,return_train_score=True)
training_error.append(mse_train)
cross_validation_error.append(np.mean(np.absolute(cve['test_score'])))fig,ax=plt.subplots(figsize=(6,6))
ax.plot(range(1,maxdegree),cross_validation_error)
ax.set_xlabel('Degree',fontsize=20)
ax.set_ylabel('MSE',fontsize=20)
ax.set_title('MSE VS Degree',fontsize=25)
Here, note that PolynomialFeatures function is used to automatically generate features till degree d.
Cross validate function is used for measuring our cross validation scores, the cv parameter here is just the k value; the number of folds or groups we want to divide our data. I have taken it as 5, I will tell you why in the next article. This function returns a json with test and train scores , and we select ‘test_score’ for our requirement and we to manually calculate their mean as they are a list of values for all groups.
So clearly the lowest MSE is for degree 5 polynomial and our initial idea wasn’t it. Now let’s check out the performance of our model taking a degree 5 polynomial we build the similar way we built models above.
We can say that our model is pretty good. And if you compare it with the validation error at degree five you’ll it’s hardly estimated.
I am a beginner in Machine Learning and if there are any mistakes or feedback you want to add please feel free to add.