I. Ozkan
Spring 2025
To reduce the variance of a base learner, \(\hat f(x)\)
Idea: given a set of \(n\) independent observations \(X_1, \dots, X_n\) each with variance \(\sigma^2\), the variance of the mean, \(\bar X\), of the observations is given by \(\sigma^2/n\)
Averaging set of observations reduce the variance
Let base learner trained using many training data sets, \(\hat f_1(x), \hat f_2(x), \cdots \hat f_B(x)\). Averaging these set of predictions reduce the variance of predictions, \(\hat f^*(x)\)
But, there is only one training set
Answer: Bootstrap, obtain \(B\) bootstrapped training sets
\(\hat f_{bag}(x)=\frac{1}{B} \sum_{b=1}^{B} \hat f^{*b}(x)\)
This can be applied to regression tree
Use majority-vote for classification tree
When to Use
Bagging does not improve the variance of cubic regression, but it does improve and smooth out the decision tree model
4 bootstrapped sets are used for generating the following trees
All have a similar structure (using similar predictors)
 
| Tree Predictions | |||||
| Medv | tree_1 | tree_2 | tree_3 | tree_4 | tree_mean | 
|---|---|---|---|---|---|
| 24.0 | 23.88 | 24.63 | 27.30 | 31.63 | 26.86 | 
| 34.7 | 33.79 | 31.78 | 34.86 | 46.45 | 36.72 | 
| 28.7 | 23.88 | 24.63 | 21.79 | 22.04 | 23.08 | 
| 27.1 | 15.71 | 15.47 | 16.00 | 17.04 | 16.06 | 
| 16.5 | 15.71 | 15.47 | 16.00 | 17.04 | 16.06 | 
| 18.9 | 20.85 | 20.81 | 21.79 | 22.04 | 21.37 | 
50 Bootstrap replications OOB RMSE: \(4.099\)
Up to 100 Bootstrap replications training and test results:
Partial Dependency Plot (PDP) may be the other way to understand the relationship
Let’s construct PDP for the first four most influential predictors
Bagging improves the prediction accuracy for high variance (and low bias) models
The use of bagging techniques tends to reduce a model’s interpretability: Variable Importance and Partial Dependency Plot may help us to interpret the bagged learner
Bagging increases the computational resources (both speed and time): Bagging is easily parallelizable since it consists of independent processes
Although obtaining sub-samples and model fitting is independent, still the fitted models are not completely independent of each other (Bagging with trees results in tree correlation)
Random Forests enhance the performance of bagged decision trees by reducing the correlation among individual trees, which in turn increases the accuracy of the overall ensemble
Bagging is used to reduce the variance
Each bootstrap sample results in correlated models (since, sampling with replacement results in highly overlapped samples)
Random forests provide an improvement over bagged trees by way of a small tweak that de-correlates the trees. This reduces the variance when we average the trees
When building these decision trees, each time a split in a tree is considered, a random selection of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors
Typically we choose the number of predictors considered at each split so that it is approximately equal to the square root of the total number of predictors \(m \approx \sqrt{p}\)
OOB Error, 500 trees with m = 4 variables: 11.2995 
 
OOB Error, 500 trees with 5 variables: 10.89118 
 
 
| Number of variables vs RMSE | |
| mtry | OOB Error | 
|---|---|
| 2 | 14.91960 | 
| 3 | 12.14668 | 
| 4 | 11.27344 | 
| 6 | 10.80146 | 
| 9 | 10.36396 | 
| 13 | 10.71012 | 
Boosting is also another general approach that can be applied to many statistical learning methods for regression or classification.
Trees are built sequentially, on top of the errors (residuals) from the previous model
Boosting learns slowly using small trees and the shrinkage parameter \(\lambda\)
There are three parameters, \(B\), number of trees (large number of trees may result in overfitting), \(\lambda\), shrinkage (generally small values such that \(0.01 \; or \; 0.001\)), and the number of split in each tree, \(d\) often \(d=1\) (stump) works well
Algorithm 8.2 (page 323 of ISLR)
\(f(x)=sin(x)+\varepsilon, \; \varepsilon \sim N(0, 0.3^2)\)
Number of trees \(B=1000\)
Shrinkage, \(\lambda=0.1\)
Tree depth (number of split in each tree), \(d=1\) (stump)
Boston Data Split into training and test sets
Using the following parameters:
It is possible to tune the parameters
First it is needed to create a grid that contains all possible values of parameters
| Parameter Grid | |||
| Only First Ten Rows | |||
| Lambda | Tree Depth | Min Obs. | Bag % | 
|---|---|---|---|
| 0.01 | 1 | 5 | 0.65 | 
| 0.05 | 1 | 5 | 0.65 | 
| 0.10 | 1 | 5 | 0.65 | 
| 0.01 | 3 | 5 | 0.65 | 
| 0.05 | 3 | 5 | 0.65 | 
| 0.10 | 3 | 5 | 0.65 | 
| 0.01 | 5 | 5 | 0.65 | 
| 0.05 | 5 | 5 | 0.65 | 
| 0.10 | 5 | 5 | 0.65 | 
| 0.01 | 1 | 10 | 0.65 | 
| Parameter Grid Results | |||||
| Only First Ten Rows | |||||
| Lambda | Tree Depth | Min Obs. | Bag % | Optimal Trees | min RMSE | 
|---|---|---|---|---|---|
| 0.10 | 3 | 10 | 0.80 | 519 | 3.466070 | 
| 0.05 | 3 | 10 | 0.80 | 928 | 3.549038 | 
| 0.10 | 3 | 15 | 1.00 | 516 | 3.570331 | 
| 0.05 | 3 | 10 | 1.00 | 579 | 3.603082 | 
| 0.10 | 5 | 10 | 0.80 | 273 | 3.604136 | 
| 0.05 | 3 | 10 | 0.65 | 818 | 3.617805 | 
| 0.05 | 5 | 10 | 0.80 | 972 | 3.628066 | 
| 0.05 | 3 | 15 | 1.00 | 763 | 3.630852 | 
| 0.10 | 3 | 10 | 1.00 | 271 | 3.650064 | 
| 0.10 | 3 | 15 | 0.80 | 387 | 3.687217 | 
| Relative Variable Importance | |
| Only First Ten Most Influential Variables | |
| Variable | Rel. Influence | 
|---|---|
| lstat | 43.22172095 | 
| rm | 31.67702890 | 
| dis | 6.89444926 | 
| nox | 5.24821280 | 
| ptratio | 3.74173961 | 
| crim | 3.28965497 | 
| age | 2.53138349 | 
| black | 0.98147906 | 
| tax | 0.91887410 | 
| chas | 0.57786684 | 
| rad | 0.52248694 | 
| indus | 0.33829170 | 
| zn | 0.05681139 |