Bagging, Random Forests, Boosting

I. Ozkan

Spring 2025

Learning Objectives

Understand Bagging, Random Forests and Boosting

Get a Brief Idea About Bootstrap Aggregation Bagging

Get a Brief Idea About Random Forests

Get a Brief Idea About Boosting

Book Chapter

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 8

Bagging: Bootstrap Aggregation

To reduce the variance of a base learner, \(\hat f(x)\)
Idea: given a set of \(n\) independent observations \(X_1, \dots, X_n\) each with variance \(\sigma^2\), the variance of the mean, \(\bar X\), of the observations is given by \(\sigma^2/n\)
Averaging set of observations reduce the variance
Let base learner trained using many training data sets, \(\hat f_1(x), \hat f_2(x), \cdots \hat f_B(x)\). Averaging these set of predictions reduce the variance of predictions, \(\hat f^*(x)\)
But, there is only one training set
Answer: Bootstrap, obtain \(B\) bootstrapped training sets

Bagging: Bootstrap Aggregation

Train method for each training set to get \(\hat f^{*b}(x)\), the prediction at point \(x\)

\(\hat f_{bag}(x)=\frac{1}{B} \sum_{b=1}^{B} \hat f^{*b}(x)\)

This can be applied to regression tree
Use majority-vote for classification tree

When to Use

Use for unstable, high variance base learners—algorithms (their predictions changes significantly with small changes in the training data)
Example: Decision Tree, KNN

Bagging: Bootstrap Aggregation

Bagging does not improve the variance of cubic regression, but it does improve and smooth out the decision tree model
Source: https://bradleyboehmke.github.io/HOML/bagging.html

Revisit: Boston Data

4 bootstrapped sets are used for generating the following trees
All have a similar structure (using similar predictors)

Revisit: Boston Data

Medv	tree_1	tree_2	tree_3	tree_4	tree_mean
Tree Predictions
24.0	23.88	24.63	27.30	31.63	26.86
34.7	33.79	31.78	34.86	46.45	36.72
28.7	23.88	24.63	21.79	22.04	23.08
27.1	15.71	15.47	16.00	17.04	16.06
16.5	15.71	15.47	16.00	17.04	16.06
18.9	20.85	20.81	21.79	22.04	21.37

Bagging: Boston Data

50 Bootstrap replications OOB RMSE: \(4.099\)
Up to 100 Bootstrap replications training and test results:

Bagging Example Boston Data: Feature Interpretation

With bagging, we can assess how features are influencing model

Bagging Example Boston Data: Feature Interpretation

Partial Dependency Plot (PDP) may be the other way to understand the relationship
Let’s construct PDP for the first four most influential predictors

Bagging: Discussion

Bagging improves the prediction accuracy for high variance (and low bias) models
The use of bagging techniques tends to reduce a model’s interpretability: Variable Importance and Partial Dependency Plot may help us to interpret the bagged learner
Bagging increases the computational resources (both speed and time): Bagging is easily parallelizable since it consists of independent processes
Although obtaining sub-samples and model fitting is independent, still the fitted models are not completely independent of each other (Bagging with trees results in tree correlation)
Random Forests enhance the performance of bagged decision trees by reducing the correlation among individual trees, which in turn increases the accuracy of the overall ensemble

Random Forest

Bagging is used to reduce the variance
Each bootstrap sample results in correlated models (since, sampling with replacement results in highly overlapped samples)
Random forests provide an improvement over bagged trees by way of a small tweak that de-correlates the trees. This reduces the variance when we average the trees
When building these decision trees, each time a split in a tree is considered, a random selection of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors
Typically we choose the number of predictors considered at each split so that it is approximately equal to the square root of the total number of predictors \(m \approx \sqrt{p}\)

Random Forest: Algorithm

Given a training data set
Select number of trees to build (n_trees)
for i = 1 to n_trees do
- 3.1 Generate a bootstrap sample of the original data
- 3.2 Grow a regression/classification tree to the bootstrapped data
- 3.3 for each split do
  - 3.3.1 Select m_try variables at random from all p variables
  - 3.3.2 Pick the best variable/split-point among the m_try
  - 3.3.3 Split the node into two child nodes
- 3.4 end
- 3.5 Use typical tree model stopping criteria to determine when a tree is complete (but do not prune)
end
Output ensemble of trees

Source: https://bradleyboehmke.github.io/HOML/random-forest.html

Random Forest Example: Boston Data

OOB Error, 500 trees with m = 4 variables: 11.2995

Random Forest Example: Boston Data (5 Vars)

OOB Error, 500 trees with 5 variables: 10.89118

Random Forest: Variable Importance

Random Forest Example: Boston Data, Predictions

Random Forest Example: Boston Data

One can compare OOB MSE with test MSE creating a further test data

Random Forest: Tuning

Tune number of vars

mtry	OOB Error
Number of variables vs RMSE
2	14.91960
3	12.14668
4	11.27344
6	10.80146
9	10.36396
13	10.71012

Boosting

Boosting is also another general approach that can be applied to many statistical learning methods for regression or classification.
Trees are built sequentially, on top of the errors (residuals) from the previous model
Boosting learns slowly using small trees and the shrinkage parameter \(\lambda\)
There are three parameters, \(B\), number of trees (large number of trees may result in overfitting), \(\lambda\), shrinkage (generally small values such that \(0.01 \; or \; 0.001\)), and the number of split in each tree, \(d\) often \(d=1\) (stump) works well
Algorithm 8.2 (page 323 of ISLR)

Boosting

Boosting: A Toy Example (https://github.com/bgreenwell)

In this example, boosting is applied to the sine() function with a noise

\(f(x)=sin(x)+\varepsilon, \; \varepsilon \sim N(0, 0.3^2)\)

Number of trees \(B=1000\)
Shrinkage, \(\lambda=0.1\)
Tree depth (number of split in each tree), \(d=1\) (stump)

Boosting: A Toy Example (https://github.com/bgreenwell)

Boosting Example: Boston Data

Boston Data Split into training and test sets
Using the following parameters:
- Number of trees, \(B=1000\)
- Shrinkage (learning rate), \(\lambda=0.1\)
- Tree depth (interaction depth) \(d=1\)

Boosting Example: Boston Data, Predictions

Boosting Example: Boston Data, Tuning

It is possible to tune the parameters
First it is needed to create a grid that contains all possible values of parameters

Lambda	Tree Depth	Min Obs.	Bag %
Parameter Grid
Only First Ten Rows
0.01	1	5	0.65
0.05	1	5	0.65
0.10	1	5	0.65
0.01	3	5	0.65
0.05	3	5	0.65
0.10	3	5	0.65
0.01	5	5	0.65
0.05	5	5	0.65
0.10	5	5	0.65
0.01	1	10	0.65

Here is the best ten parameter values based on minimum RMSE

Lambda	Tree Depth	Min Obs.	Bag %	Optimal Trees	min RMSE
Parameter Grid Results
Only First Ten Rows
0.10	3	10	0.80	519	3.466070
0.05	3	10	0.80	928	3.549038
0.10	3	15	1.00	516	3.570331
0.05	3	10	1.00	579	3.603082
0.10	5	10	0.80	273	3.604136
0.05	3	10	0.65	818	3.617805
0.05	5	10	0.80	972	3.628066
0.05	3	15	1.00	763	3.630852
0.10	3	10	1.00	271	3.650064
0.10	3	15	0.80	387	3.687217

Boosting Example: Boston Data, Tuning

For the tuned parameters (using 5-fold CV with limited search grid) we obtain:
- Number of trees, \(B=519\)
- Shrinkage (learning rate), \(\lambda=0.1\)
- Tree depth (interaction depth) \(d=3\)

Variable	Rel. Influence
Relative Variable Importance
Only First Ten Most Influential Variables
lstat	43.22172095
rm	31.67702890
dis	6.89444926
nox	5.24821280
ptratio	3.74173961
crim	3.28965497
age	2.53138349
black	0.98147906
tax	0.91887410
chas	0.57786684
rad	0.52248694
indus	0.33829170
zn	0.05681139

Boosting Example: Partial Effect Plot

Boosting Example: Predictions

There are several more plots are available but skipped