MIS 302

Cross Validation and Bootstrapping

I. Ozkan

Spring 2025

Preliminary Readings

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 5
Introduction to Data Science, Rafael A. Irizarry, Chapter 29
Resampling Methods, UC Business Analytics R Programming Guide

Learning Objectives

(k-fold) Cross Validation
Repeated Cross Validation
Leave-One-Out Cross-Validation
Bootstrap

Validation Approach

Data set is split into two subsets (in general) called training data and test data (validation set, hold-out set) (often called validation or hold-out approach)
Training data set is used to fit the model and test data is used to assess the performance of the modes (using observations that are new to the model)
The training error rate, often is quite different from the test error rate. The former can dramatically underestimate the latter.
To better estimate the test error rate, one approach is to hold out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations

An Example for Illustration

If there is non-linear relationship between response and predictor, polynomial regression may be used and choosing the polynomial degree is important.
The estimated coefficients can be assessed with p-values
Calibration, model selection and assessment of coefficients can be performed using validation set (Model Evaluation: See related slides)
Example: Auto data set of ISLR package is used for illustration

ISLR Package auto Data
mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
18	8	307	130	3504	12.0	70	1	chevrolet chevelle malibu
15	8	350	165	3693	11.5	70	1	buick skylark 320
18	8	318	150	3436	11.0	70	1	plymouth satellite
16	8	304	150	3433	12.0	70	1	amc rebel sst
17	8	302	140	3449	10.5	70	1	ford torino
15	8	429	198	4341	10.0	70	1	ford galaxie 500

An Example for Illustration

Validation set is obtained by randomly selecting training and test data. In this example, 30% of the observations are selected randomly for validation
The models shown are polynomial models (orthogonal). The linear model is \(mpg=\beta_0+\beta_1 \cdot horsepower+\varepsilon\)
Assess the change in \(MSE\) for both training and test data set

An Example for Illustration

Repeat the previous example for number of different times. For example, figure below shows the repetition of this procedure 10 times
MSE obtained for each train and data set that are randomly sampled using different polynomial degrees

An Example

As it is seen from both the figure given previous slide and the table given below, the test data set MSE values do suggest to use second degree polynomial since increasing the degree of polynomial does not provide highly significant performance for all randomly split samples
Since randomly selected training data sets are used to fit the model, test data set MSEs are generally over estimated (seen from the figure on the previous slide)

Minimum MSE Values vs Polynomial Degree for All Samples
Polynomial Degree	Min. Training MSE	Min. Test MSE
1	22.291	20.416
2	17.493	16.767
3	17.350	16.701
4	17.273	16.674
5	16.817	16.025
6	16.674	15.710
7	16.529	15.381
8	16.529	15.420
9	16.522	15.667
10	16.517	16.088

Leave-One-Out Cross-Validation (LOOCV)

Similar to validation set approach, but, the validation contains only one observation
For each observations, \((x_j, y_j), \: j=1,2,\dots,n\) use remaining \(n-1\) observations as training set, then predict \(\hat y_j\) and calculate, \(MSE_j=(y_j-\hat y_j)^2\)
The average of the test error MSEs is LOOCV estimate

\(CV_{(n)} = \frac{1}{n}\sum^n_{i=1}MSE_i\)

LOOCV

LOOCV estimate for linear regression,

\(mpg_i=\beta_0+\beta_1horsepower_i+\varepsilon_i\) is: 24.232

The figure below shows LOOCV estimates for polynomials with degrees from one to ten. After second degree, there is no clear improvement

k-Fold Cross Validation

LOOCV approach is very computationally expensive for large data sets.
An alternative approach is to use k-Fold Cross Validation
Randomly split the data into k-groups (folds) of equal size
Use the first set as a validation set. Fit the model using remaining data. Calculate \(MSE_1\)
Repeat above step for the remaining k-1 groups and calculate \(MSE_j, \: j=2,\dots k\)
Thus, k-fold Cross Validation estimate is

\(CV_{(k)} = \frac{1}{k}\sum^k_{j=1}MSE_j\)

k-Fold

LOOCV is a special case of k-fold approach where k is set to number of observations
Five to ten folds produce a good estimates for test error rates

Bootstrapping

Bootstrapping is a very powerful statistical tool to quantify the uncertainty associated with a given estimator
It involves, repeatedly drawing independent samples from data set with replacement
An example of a data set with three observation (n=3) and samples drawn from this set are shown in the figure below:

bootstrap

Each bootstrap data contains, \(Z^{*1}, Z^{*2}, \dots, Z^{*B}\), three observations (n=3) and it is used to compute the estimated statistic we are interested in say, \(\hat \alpha^*\)
All the bootstrapped data sets are used to compute the standard error of \(\hat\alpha^{*1}, \hat\alpha^{*2}, \dots, \hat\alpha^{*B}\)

\(SE_B(\hat\alpha) = \sqrt{\frac{1}{B-1}\sum^B_{r=1}\bigg(\hat\alpha^{*r}-\frac{1}{B}\sum^B_{r'=1}\hat\alpha^{*r}\bigg)^2}\)

Bootstrap: An Example (ISLR, page 187)

“Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of \(X\) and \(Y\) , respectively, where \(X\) and \(Y\) are random quantities. We will invest a fraction \(\alpha\) of our money in \(X\), and will invest the remaining \(1-\alpha\) in \(Y\)”
As variance is used for a risk measurement we want to minimize the variance, \(Var(\alpha X+(1-\alpha)Y)\), the estimated value of \(\hat \alpha\) that minimizes the risk is:

\(\hat\alpha = \frac{\hat\sigma^2_Y - \hat\sigma_{XY}}{\hat\sigma^2_X +\hat\sigma^2_Y-2\hat\sigma_{XY}}\)

The estimate of \(\hat \alpha\) for Portfolio data of ISLR package is 0.5758
Estimated \(\hat \alpha\) for 10 Bootstrapped samples using Portfolio data of ISLR package is shown below.

Bootstrap	Alpha
1	0.4483
2	0.5609
3	0.5053
4	0.6836
5	0.6108
6	0.5820
7	0.5013
8	0.5379
9	0.6151
10	0.5374

The average of estimated \(\hat \alpha=0.5583\)
Let’s obtain the the distribution of the \(\hat \alpha\) using 1000 bootstrap replications


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Portfolio, statistic = statistic, R = 1000)


Bootstrap Statistics :
     original      bias    std. error
t1* 0.5758321 0.004719558  0.09020046

Bootstrap Regression Example

Let’s use wage and education data comes with AER package
The data contains, \(wage, education, experience, ethnicity\) variables together that we are going to use as an example
The basic model is:

\(ln(wage)=\beta_0 + \beta_1 \: experience+ \beta_2 \: experience^2 + \beta_3 \: education + \beta_4 \: ethnicity + \varepsilon\)

Least squares estimation:


Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education + 
    ethnicity, data = CPS1988)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9428 -0.3162  0.0580  0.3756  4.3830 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.321e+00  1.917e-02  225.38   <2e-16 ***
experience       7.747e-02  8.800e-04   88.03   <2e-16 ***
I(experience^2) -1.316e-03  1.899e-05  -69.31   <2e-16 ***
education        8.567e-02  1.272e-03   67.34   <2e-16 ***
ethnicityafam   -2.434e-01  1.292e-02  -18.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5839 on 28150 degrees of freedom
Multiple R-squared:  0.3347,    Adjusted R-squared:  0.3346 
F-statistic:  3541 on 4 and 28150 DF,  p-value: < 2.2e-16

Regression Coefficients CI
Variable	2.5%	97.5%
(Intercept)	4.28381	4.35898
experience	0.07575	0.07920
I(experience^2)	-0.00135	-0.00128
education	0.08318	0.08817
ethnicityafam	-0.26868	-0.21804

Bootstrapped estimations confidence interval (1000 bootstrap replicates):

Bootstrapped Regression Coefficients CI
Variable	2.5%	97.5%
(Intercept)	4.28110	4.36191
experience	0.07548	0.07948
I(experience^2)	-0.00136	-0.00127
education	0.08297	0.08836
ethnicityafam	-0.26951	-0.21739

Density Estimation of Bootstrapped quantity:

MIS 302 Cross Validation and Bootstrapping

Preliminary Readings

Learning Objectives

Validation Approach

An Example for Illustration

An Example for Illustration

An Example for Illustration

An Example

Leave-One-Out Cross-Validation (LOOCV)

k-Fold Cross Validation

Bootstrapping

Bootstrap: An Example (ISLR, page 187)

Bootstrap Regression Example

MIS 302

Cross Validation and Bootstrapping