Bootstrapping & Confidence Intervals in Machine Learning

Aamir Ahmad Ansari
5 min readOct 8, 2021

--

‘How you doin’ a legend said and now I ask you the same. I hope everyone’s doing well.

Today we will be talking about bootstrapping and confidence intervals by covering the following topics:

  1. Introduction to Bootstrapping
  2. What are Confidence Intervals
  3. Implementation in Python

Let’s get started!

1. Introduction to Bootstrapping

Bootstrapping in machine learning is a technique of resampling the data which estimates the statistical properties of the population(data) by sampling our dataset with replacement; when you push an observation from original data into a new sample of data you push a replica of the observation in the original data set.

1.1 How Bootstraps works?

Bootstrapping is an iterative process where we:

  1. Select the number of times we want to perform it.
  2. Selecting our sample size, m.
  3. For each sample:

a) create a new data set of size m by drawing random samples with replacement.

b) Fit the model on the new data set.

c) Estimate on the data not selected in the new data set( Out of the bag data) or on the whole original dataset(which we will do in our example).

4. Calculate central tendencies and confidence intervals.

Yes, I know we haven’t talked about confidence intervals yet so let’s dot now.

2. Confidence Intervals

Confidence Intervals(CI) in machine learning and statistics is a probability that a statistical property of the population will lie in an estimated range. CI’s are estimated in units of standard deviation from the mean. For a better understanding let’s consider our data is from we have a normal distribution and we will inspect the following illustration:

In above illustration of a normal distribution depicts the 3 sigma rule that that 68% data resides under one standard deviation from the mean, 95% data resides under two standard deviations from the mean and 99% data resides under 3 standard deviations from the mean. We can also say that 95% of our data here will reside in the range (mean-2*std,mean+2*std). Hence here the 95% confidence interval is (mean-2*std,mean+2*std). Similarly for any other distributions we have the privilege to find these confidence interval and in the implementation we’ll show how.

Now we will see how we use bootstrpping and confidence intervals side by side.

3. Implementation in Python

We will generate our data set today and will implement simple linear regression in our bootstrapping. We will estimate 95% CI of our beta1 and check wether our beta1 lies in that interval or not and also we will calculate 95% CI of our predictions.

How about importing our libraries first:

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

Bravo! Let’s quickly generate our dataset and have a look at it,

# Let's create a data frame and have a look at it
X=np.linspace(0,1,100)
Y=3*X+2
Y=Y+np.random.normal(0,1,Y.shape)
X = X.reshape(-1, 1)
plt.scatter(X,Y)
plt.xlabel('X',fontsize=15)
plt.ylabel('Y',fontsize=15)
plt.title('Relationship b/w X and Y',fontsize=17)

The noise really made our signal go wild XD !

So what are we waiting for let’s bootstrap. What do we dom first ? Choose number of bootstraps, correct?

#let's select the number of bootstraps we want
boot=500

We will use df.samples() to generate our samples so for that a we have to convert the numpy array into DataFrame.

#Create a bootstrap
df=pd.DataFrame(X,columns=['X'])
df['Y']=Y

Let’s start bootstrapping then :D,

#Iteration Begins 
X = X.reshape(-1, 1)
betas=[]
preds=[]
for i in range(boot):
df1=df.sample(len(df),replace=True) #random sample with replacement, frac=1 mean size will be the size of data set
lr=LinearRegression()
model=lr.fit(df1[['X']],df1['Y'])
y_pred=model.predict(X)
plt.plot(X,y_pred,color='red',alpha=0.1)
betas.append(model.coef_[0])
preds.append(y_pred)
plt.scatter(X,Y,label='Original')
plt.legend()
plt.title('Bootstraping Estiamtions',fontsize=15,color='green')

Note we have plotted a regression line for each fit; for each iteration. Do you notice the variation in the predictions. We are using bootstraps to estimates where our prediction may lie most of the time.

Now that we have stored values for betas and prediction in each iteration, we will look at the distribution of beta1.

You can see the mean is around 2.4 which is off from our original beta1. Let’s see what our CI have the original value or not.

#Calculating 95% Confidence Intervals for beta1
CI=(np.percentile(betas,2.5),np.percentile(betas,97.25))
print(CI)

Wooh, original value of beta1 is in the CI. So what does this interval mean, it means: We are 95% confident that optimal beta1 value lie in this range.

Can I interest you in an illustration of the distribution with CI ? Let’s have it, mean while you decide.

Cool right! You can do it too!

Now let’s visualize the 95% CI of our predictions.

#95% Confidence Interval of precdiction
fig,ax=plt.subplots(figsize=(6,6))
mean=np.array(preds).mean(axis=0)
plt.plot(X,mean,label='Mean at each bootstrap',color='blue')
plt.xlabel('Y-values',fontsize=15)
plt.ylabel('X-values',fontsize=15)
plt.title('95% Confidence Interval of Predictions',fontsize=17,color='red')
plt.fill_between(X.flatten(),mean-(2*np.std(np.array(preds),axis=0)),mean+(2*np.std(np.array(preds),axis=0)),alpha=0.2)
plt.legend(fontsize='small')
plt.scatter(X,Y)
plt.tight_layout()

Remember we discussed, 95% values lie under 2 standard deviations from the mean. That’s it, we flatten the array because the function takes a 1D array.

You can find the code here at Github.

I hope this was helpful. Any doubts or improvements? please feel free to use the comment box!

--

--

Aamir Ahmad Ansari
Aamir Ahmad Ansari

Written by Aamir Ahmad Ansari

Data Scientist @RoadzenTechnologies | Upcoming MSc AI student at University of Southampton (2024-2025) | Sharing as Learning

No responses yet