Multi-Linear Regression
Hello, Today we will talk about multi-linear regression and we will see a better way to implement our model. The following topics will be making up for the article:
- Introduction to Multiple Linear Regression
- Implementation in Python
Let’s Begin!
1. Introduction to Multi-Linear Regression
In linear regression we used only 1 predictor and found it’s relationship with our response variable. what if there are more than one predictors available to us, how do we find the relationship between them and the response y ?
We solve this problem using Multi-Linear Regression. It uses multiple predictor variables to predict y and is an extension to simple linear regression. In real world scenarios, you’ll come across problems associated to it more than simple linear regression.
It makes the following assumptions on the data:
- Linear relationships between independent variables and dependent variable. Note that in real world it may not be 100% linear.
- Multivarate Normality, the data should come from a normal distribution and the residuals should be normally distributed.
- Indepedent Observations, observations should be independent from one another.
1.1 Calculations
The equation for multi-linear regression is given below:
To estimate beta values, let’s take a break from Linear Algebra and get in to matrices. Assume we have Y as a column matrix and X a row matrix, where:
Here, j is the number of predictors and n is the sample size. This will give you the optimal values of betas.
Now let’s move on to the implementation.
2. Implementation in Python
We will use the Boston Housing dataset, it is already available in sklearn and I will tell you how to get that. Let’s get started then.
What do first, Libraries? Yeah !
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.datasets import load_boston
import scipy.stats as stats
%matplotlib inline
Yes, the data set now:
df=load_boston()
x=df.data
y=df.target
Here load_boston() returns a json with predictors as data and response variable as target. We should convert it into a data frame first for simplicity as x and y are both numpy arrays for now.
#Create Column Names and Pandas Data Frame X for our simplicfication
cols=[]
for i in range(13):
cols.append(f'X{i}')
print(f'The column Names are {cols}')
X=pd.DataFrame(data=x,
columns=cols)
X.head()
Congrats, Data Frame successfully created and we have also given dummy names to our columns.
Now let’s see if our assumptions on the data are correct or not for a better model development which will lead us to select important features only.
2.1 Univarate Normality Analysis
We’ll check for skewness(measure of symmetry) and kurtosis(a measure to tell if the data is heavy tailed or lightly tailed) in the data, Typically if the skewness coefficient is in range[-1,1] that data is considered to be normally distributed and kurtosis coeff between [-3,3] that says the data is neither to flattened and nor too prone to outliers.
skew_kurt=[]
skewness=stats.skew(X)
kurtosis=stats.kurtosis(X)
for i,j in enumerate(X):
skew_kurt.append([j,skewness[i],kurtosis[i]])
skew_kurt=np.array(skew_kurt)
print(skew_kurt)
Great, we’ve completed our univaratenormality analysis and now we will only take the predictors which are in the above given ranges and hence we will drop X0, X1, X3. X7, X8 and X11.
2.2 Multivarate Normality Analysis
We do it by using sns and building pair plots to view correlation between predictors and we also get to see a univarate normality analysis using histograms. Let’s check it out:
X.drop(['X0','X1','X3','X7','X8','X11'],inplace=True,axis=1)
sns.pairplot(X)
Here you can see the distribution plots on the diagonals and correlations between each predictor. Let me know you inferences in the comments.
But to see correlation more closely we’ll create a matrix with the response variable,
X['target']=y
X.corr()
Now this more clear, eh? We will take consider strong correlation between two variables if it correlation coeff is greater than 0.7. Clearly from the above matrix we have a strong correlation between:
X2 and X4
X2 and X9
X4 and X6
Also in all the variables X5 is best positively correlated with our target variable. But we wish to eliminate the correlation between the predictors and hence we will drop column X4 and X9 of the above pairs from X. An important think to note is when you cannot decide which column to drop between to strongly correlated ones, drop the one with higher variance.
Now that we have done the normality tests, let’s check if the selected predictors have a linear relationship with our response variable or not and we will do it with Regression Analysis.
2.3 Regression Analysis
It is carried out by performing Simple Linear Regression; regressing our response variable on each of the predictor variable and observe the relationship.
x=X.drop(['target','X9','X4'],axis=1)
fig,ax=plt.subplots(nrows=3,ncols=2, figsize=(15,15))
index=[]
t=0
for i in range(3):
for j in range(2):
index.append((i,j))
print(index)
for i in x:
x_train,x_test,y_train,y_test=train_test_split(x[[i]],y,train_size=0.7)
lr=LinearRegression()
model=lr.fit(x_train,y_train)
y_train_pred=model.predict(x_train)
# print(f'Train R2 Score{i}',r2_score(y_train,y_train_pred))
y_test_pred = model.predict(x_test)
# print(f'Test Error for {i}',mean_squared_error(y_test,y_test_pred))
# print(f'Train R2 Score for {i}',r2_score(y_test,y_test_pred))residuals=abs(y_test-y_test_pred)
if(t<8):
ax[index[t][0],index[t][1]].scatter(x[i],y,label='Original')
ax[index[t][0],index[t][1]].plot(x_test,y_test_pred,label='Prediction')
ax[index[t][0],index[t][1]].set_xlabel(f'{i}',fontsize=20)
ax[index[t][0],index[t][1]].set_ylabel('target',fontsize=20)
t+=1
plt.tight_layout()
In the code cell, I have also provided code to compute r2 and mse for each regression, if you would like to try it.
We can see the linear relationship above between the predictors and the response hence our this assumption is also true for the data.
2.4 Model Developement
So now that we our all ready, let’s do it.
If we you followed my last two articles, you’d know that it’s time to split that data:
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7)
Awesome, Let’s build our model now:
lr=LinearRegression()
model=lr.fit(x_train,y_train)
y_train_pred=model.predict(x_train)
print('Train Error',mean_squared_error(y_train,y_train_pred))
print('Train R2 Score',r2_score(y_train,y_train_pred))
y_test_pred = model.predict(x_test)
print('Test Error',mean_squared_error(y_test,y_test_pred))
print('Train R2 Score',r2_score(y_test,y_test_pred))
residuals=abs(y_test-y_test_pred)
sns.displot(residuals,bins=20)
We good an appreciable R² score for this model, eh? and our residuals are also screwed rightly but acceptable. Try not using X2 in your model, and the model will be a better one.
I am a beginner in Machine Learning and if there are any mistakes or feedback you want to add please feel free to add.