Ensemble Learning — Part 1[Bagging & Random Forest]

4 min readMay 2, 2022

Galactic Spiral Humandala Connection- GIPHY

Introduction

A finger may not be as strong on its own, but a fist is. Ensemble learning takes advantage of this simple fact. Rather than a single estimator, it uses a lot of them. It covers for the shortcomings of a decision tree; 1) It over fits your data as the tree grows (we are decreasing the bias by increasing the depth of the tree) and hence the model variance will also increase when the model starts over-fitting. The model variance is the variance in the predictions of a model when all the other hyper-parameters are untouched and the model is trained on different subsets of the training data. A little hyper-parameter tuning may help us achieve the bias variance-tradeoff and we can avoid over-fitting. Today, we will learn how to reduce the variance without increasing the bias, which is ensemble learning.

Motivation

Ensemble Learning in principle takes n estimators and, to provide the output, it averages the output of these n estimators. For example, if you have 5 discrete random variables independent of each other and each random variable can take 5 values. These random variable have mean 0 and variance 1 as we sample them from a normal distribution. Now, stack them one over the other like a matrix and create a row matrix where each column is the average of the corresponding column of the 5*5 matrix. What do you think would be the mean of these values and variance? The mean would also be the sum of the means of these 5 distributions, that is 0 and the Bienayme formula gives the variance, as:

Where p is reflects the dependencies between random variables. For independent variables, p=0 and for highly correlated variables, if the correlation is less than some bound k, then p will be less than k [1]. Though in our example, the variables are independent of one another, therefore, p=0 and variance is 0.2. We will build on this and see two very important tree based ensemble learning in this article, bagging and boosting.

Bagging Algorithm

It is an ensemble learning technique which helps us to reduce the model variance by building n number of trees on bootstrapped samples of the training set. Bagging models are more reliable than a singular decision tree. The bagging algorithm is quite simple, yet elegant. Follow the steps below to perform bagging:

Hyper-tune your base estimator that is the decision tree.
Create n bootstraps/sample of your training set of the same size, with replacement.
Fit the tree from 1. on each of the n bootstrapped sample.
Prediction for a sample is generated by taking the majority votes from all the n estimators in a classification problem or by taking the mean of all the n outputs in a regression problem.

The bagging models use many decision trees to build an ensemble of trees. All these trees have access to every feature in the train set at the time of split to choose the optimal feature and a corresponding threshold. It is very probable that these trees use similar features to split on and probably the final predictions will be correlated. The variance will not decrease by a proportion n. If we notice the above formula, then we know that the higher will be the correlation, the lower will be, the decrease in the variance. To counter this problem, we use the Random Forest Algorithm.

Random Forest Algorithm

The random forest algorithm is like the bagging algorithm with an additional source of randomness that is at each split a randomly chosen subset of features is considered. This ensures that each split gets a random subset, but also every tree to get a random subset for a split. This additional source of randomness decreases the correlation between the trees and a greater decrease in the overall variance. The random forest algorithm is as follows:

Hyper-tune your base estimator that is the decision tree.
Create n bootstraps/sample of your training set of the same size, with replacement.
Fit the tree from 1. on each of the n bootstrapped sample, at each split offer a randomly chosen subset of features.
Prediction for a sample is generated by taking the majority votes from all the n estimators in a classification problem or by taking the mean of all the n outputs in a regression problem.

The length used for this randomly chosen subset by practitioners is N/3 for regression and square root (N) for classification, where N is the number of features in the dataset.

References

[1] Understanding the Effect of Bagging on Variance and Bias visually

This article was to give you a basic intuition of these basic tree based ensemble methods. In the next articles, we’ll see how to implement these algorithms, what are the validation approaches, how to calculate the feature importances and some model interpretation techniques.