Logistic Regression in Machine Learning

Aamir Ahmad Ansari
8 min readOct 12, 2021

Hello, today we will be discussing logistic regression. The following will be the topics discussed :

1. Introduction to Logistic Regression

2. Detailed Implementation in Python

Let’s get started!!!!

Updated Code for T-test on 13/10/2021 at 12:44 pm

1. Introduction to Logistic Regression

1.1 Introduction

Logistic regression is a regression technique in machine learning which is used in binary(0/1) classification. You can visualise it as a combinational digital logic circuit that uses different logic operators(can be thought as predictors) but produce a binary(0/1) output. Logistic Regression is widely used when the response variable to be predicted has two classes. For example: classifying if a person is male or female or classifying images of pets like cats and dogs. Notice that, here we can label cats as 0 and dogs 1.

Logistic Regression calculates the conditional probability P(Yi|x=Xi) where Yi is our predicted response 0 or 1, by default the probability is calculated for Yi=1. Xi represents the predictor vector for the ith observation. How does it do that? it uses sigmoid function as it ranges from [0,1] and hence perfect for mapping probabilities as remember the probability of any event ranges from [0,1]. It was supposed to be a classifier, then what’s with probabilities?

Logisttic Rgression classifies our observation by comparing their probabilities to a threshold or critical value; if P(Yi=1|x=Xi)≥critical value, the response is 1 else 0.

The statistic properties are more or less similar to linear regression. Weights are associated with each predictor, there is a bias term and all the assumptions are the same as linear regression apart from having a linear regression between predictors and response. Let’s have a look at the function that we use for logistics regression, i.e the sigmoid function.

Sigmoid function for logistic regression

The above equation represents one predictor, you can add other predictors as in linear regression.

1.2 Interpretation of Beta Values

The interpretation of beta values is different from linear regression where a unit increase in a predictor leads to the increment in the response by the corresponding beta value. To understand the interpretation of beta values better, let’s first take the log of the above equation and the resultant of this process will be the log-odds.

Clearly, from above you can see the log-odds are linearly related to our predictors and a unit increase in a predictor lead to the multiplication of the response by the exponential of corresponding beta value; e^beta. Note that the probability change in P(Yi|x=Xi) per unit change in X depends on the value of X and doesn’t get affected by the coefficients. The magnitude of the coefficient of the predictor in this case beta1 plays a role as if the magnitude of the coefficient is positive then there will be a positive change in the probability when X is increased. Else if the magnitude is negative then an increase in X will associate with a decrement in probability.

2. Detailed Implementation in Python

This is a detailed implementation of logistic regression in python. We will go by exploring our data and then building a model. The dataset is the Heart Disease data set which can be downloaded by clicking here. Go on, download the data set let’s do this exercise together. We will do this exercise in two parts:

a) Exploration and Feature Selection

b) Model Development

The link for the whole code will be provided soon at the end of the article.

Before moving on we shall first get all the libraries we will need, yeah?

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
%matplotlib inline

Let’s explore these little things XD.

2.1 Exploration and Feature Selection

To explore a dataset we will need a dataset, load it ? You better.

df=pd.read_csv('A:/Datasets/heart.csv')
df.head()
Heart Disease DataFrsame

Nice! Remember the target column is the response variable.

a) The next step will be to find which variables are categorical and which one’s are continuous. The good news is we do not have to do anything fancy we will use nunique() functions of data frames in pandas to find unique values in each column and the one’s which have more than 5 unique values will be said as categorical and the rest to be continuous.

#Find Categorical Data, I am considering Data having 5 or less than 5 classes as categorical
print(df.nunique())
print('The length of data is',len(df))

We deduce from the above that sex, cp, fbs, restecg, exang, slope, ca, that, target are categorical and age trestbps, chol thalach, oldpeak are continuous.

b) Let’s move ahead now, by finding nan and null values in our dataset. We can do that by pandas isna() and isnull() functions.

#Find null and na values in data set 
print('The nan values in dataset are \n',df.isna().sum())
print('The null values in dataset are \n',df.isnull().sum())

You’ll notice something like this in both the cases:

You can observe that none of the columns contains nan or null values in the data set, that is a sad thing as I couldn’t show you guys how to deal with them but when we will deal with outliers you will get the idea.

c) Time to find the Outliers

Outliers are unusual values from your data set that tend to break the pattern of the data or as we say in normal distributions a value 2 or 3 standard deviations from the mean. These values often lead to incorrect estimations of coefficient and reduced accuracy of our model.

We’ll find the outliers first visually by plotting the box plots for all the continuous variables. Let’s first plot the box plot and then I’ll explain how to understand it.

#boxplot for continious variables
t=plt.boxplot(df[['age', 'trestbps', 'chol', 'thalach', 'oldpeak']])

This is a special kind of plot that plots a box, the Inter Quartile Range(IOR,75th percentile - 25th percentile) of the sorted predictors the length of the box is the IQR and the and the extreme points are calculated as the lower extremity Q1(1st quartile or 25th percentile) — 1.5*IQR and upper extremity Q3(3rd quartile or 75th percentile)+1.5*IQR. Anything beyond these extreme points is an outlier, which is represented by bubbles in the plot.

d) Deal with outliers, it’s not a good practice to just delete the outliers from your dataset as it reduces the sample size, so what do we do ? we put a standard value instead of them. We’ll replace all the outliers greater than the 90th percentile of our data with the 90th percentile and the one’s less than the 10th percentile with the 10th percentile.

df_cont=df[['age', 'trestbps', 'chol', 'thalach', 'oldpeak']].values
#No outlier
for i in range(5):
new_col=sorted(df_cont[0:len(df),i].flatten())
tenth_percentile = np.percentile(new_col, 10)
ninetieth_percentile = np.percentile(new_col, 90)
df_cont[0:len(df),i]= np.where(df_cont[0:len(df),i]<tenth_percentile, tenth_percentile, df_cont[0:len(df),i])
df_cont[0:len(df),i]= np.where(df_cont[0:len(df),i]>ninetieth_percentile, ninetieth_percentile, df_cont[0:len(df),i])

Great! We took care of the outliers. Let’s check if we did that right.

#boxplot for continious variables
fig,ax=plt.subplots(figsize=(10,10))
t=plt.boxplot(df_cont,labels=['age', 'trestbps', 'chol', 'thalach', 'oldpeak'])
plt.title('Dataset without Outliers',fontsize=20)
plt.xlabel('Predictor',fontsize=18)
plt.ylabel('Predictors values',fontsize=18)
plt.savefig('S:/AI&ML/DatasetwithoutOutliers.jpg')

Not a single one, well-done avengers!!! Enough exploration, for now, let’s move on to feature selection.

e) T-test(Between Continous feature and Categorical Response) & Chi2 Score (Between Categorical feature and Categorical Response)

The T-test is a statistical test between the mean of two samples to know we they are samples are similar. A large value of t-stat for a feature for different classes of response indicates that the samples are different for both the classes and should be considered for predictions.

Chi2 scores are calculated to check the independence of two variables. We’ll calculate both the statistics taking the Null Hypothesis: The response variable is not dependent on the feature/predictor. Alternate Hypothesis: the response variable is dependent on the feature/predictor.

Let’s get this party going!!

#Feature Selection using T-statistic
df_1=df[df['target']==1][['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
df_0=df[df['target']==0][['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
mean_1=np.mean(df_1,axis=0)
mean_0=np.mean(df_0,axis=0)
std_0=np.std(df_0,axis=0)/len(df_0)
std_1=np.std(df_1,axis=0)/len(df_1)
t_val=(mean_1-mean_0)/(std_0+std_1)**0.5

Note: Feature Importance decreases with decreasing t-test score

Let’s get the p-values

def get_pvalues(tt,n):
p_val=stats.t.sf(abs(tt),n-1)*2
return p_val
p_val=get_pvalues(t_val,len(df1))
confidence=1-p_val

Remember critical value for p-values is 0.05 and for 1-p-values was 0.95. So if any feature has 1-p-values (confidence) ≥0.95, the null hypothesis will be rejected and the feature will be selected.

fig,ax=plt.subplots(figsize=(10,10))
plt.barh(df1.columns,confidence,label='1-p_values',color='orange')
plt.axvline(0.95,color='blue',linestyle='--',label='95% Chance')
plt.legend()

Wooh, very competitive, everyone rejects the null hypothesis and are selected.

Next for Categorical variables, Chi2 score. Note the Chi2 function in sklearn returns p-values too so no need to calculate.

df_cat=df.drop(df1.columns,axis=1)
df_cat.drop('target',axis=1,inplace=True)
chi2_scores=chi2(df_cat,df['target'])
###### Sorting and maintaining labels for graph plotting ######chi_zip=zip(df_cat.columns,chi2_scores[0])
chi_zip=sorted(chi_zip,key=lambda x:x[1])
chi_p_zip=zip(df_cat.columns,chi2_scores[1])
chi_p_zip=sorted(chi_p_zip,key=lambda x:-x[1])
cols,chi2_val=zip(*chi_zip)
cols1,chi2_p_val=zip(*chi_p_zip)
chi2_p_val=list(map(lambda x:1-x,chi2_p_val))
plt.barh(cols,chi2_val,label='CHI2 Scores',color='orange')
# plt.axvline(0.95,color='blue',linestyle='--',label='95% Chance')
plt.legend()

These were the Chi2 scores and now let’s look at the p — values.

We notice here that other than restecg and fbs every feature rejects the null hypothesis and hence we’ll not include them in our model.

2.2 Model Development

Now that we have explored, corrected and chosen our data, we will build our model.

We start by merging our continuous and categorical variables and the target variable to y and our features to X.

df_cat.drop(['fbs','restecg'],inplace=True,axis=1)
X=pd.concat([df1,df_cat],axis=1)
y=df['target']

Yes, the data split

x_train,x_val,y_train,y_val=train_test_split(X,y,train_size=0.8)

Here comes the model,

lr=LogisticRegression(C=1000,max_iter=5000,random_state=1)
model=lr.fit(x_train,y_train)
y_train_pred=model.predict(x_train)
y_val_pred=model.predict(x_val)
print(accuracy_score(y_train,y_train_pred))
print(accuracy_score(y_val,y_val_pred))

C in logistic regression is 1/lambda, which means we are penalising the regularization rate(lambda). What this means is if you give a large value to C, you are confident that your dataset is the correct interpretation of real-world data and hence the regression coeffs will be large and if you assign a small value to C you are not confident that your data mimics the real-world data well and hence the lamda will be large and regression coeffs small. max_iter is for the max iterations for the solver behind the scenes to converge.

Find out why random state is being used in logistic rgeression and if you can’t ask me in comments.

Just FYI, Train Accuracy:85.5% and Validation Accuracy:83.6%. A pretty good model, isn’t it?

Thank you that’s it for today, I hope you find it useful.

--

--

Aamir Ahmad Ansari

Sharing knowledge is gaining knowledge. Data Science Enthusiast and Master AI & ML fellow @ Univ.Ai