MIS 302

Cross Validation and Bootstrapping

I. Ozkan

Spring 2025

Preliminary Readings

Learning Objectives

Validation Approach

An Example for Illustration

ISLR Package auto Data
mpg cylinders displacement horsepower weight acceleration year origin name
18 8 307 130 3504 12.0 70 1 chevrolet chevelle malibu
15 8 350 165 3693 11.5 70 1 buick skylark 320
18 8 318 150 3436 11.0 70 1 plymouth satellite
16 8 304 150 3433 12.0 70 1 amc rebel sst
17 8 302 140 3449 10.5 70 1 ford torino
15 8 429 198 4341 10.0 70 1 ford galaxie 500

An Example for Illustration

An Example for Illustration

An Example

Minimum MSE Values vs Polynomial Degree for All Samples
Polynomial Degree Min. Training MSE Min. Test MSE
1 22.291 20.416
2 17.493 16.767
3 17.350 16.701
4 17.273 16.674
5 16.817 16.025
6 16.674 15.710
7 16.529 15.381
8 16.529 15.420
9 16.522 15.667
10 16.517 16.088

Leave-One-Out Cross-Validation (LOOCV)

\(CV_{(n)} = \frac{1}{n}\sum^n_{i=1}MSE_i\)

LOOCV
LOOCV

LOOCV estimate for linear regression,

\(mpg_i=\beta_0+\beta_1horsepower_i+\varepsilon_i\) is: 24.232

k-Fold Cross Validation

\(CV_{(k)} = \frac{1}{k}\sum^k_{j=1}MSE_j\)

k-Fold
k-Fold

Bootstrapping

bootstrap
bootstrap

\(SE_B(\hat\alpha) = \sqrt{\frac{1}{B-1}\sum^B_{r=1}\bigg(\hat\alpha^{*r}-\frac{1}{B}\sum^B_{r'=1}\hat\alpha^{*r}\bigg)^2}\)

Bootstrap: An Example (ISLR, page 187)

\(\hat\alpha = \frac{\hat\sigma^2_Y - \hat\sigma_{XY}}{\hat\sigma^2_X +\hat\sigma^2_Y-2\hat\sigma_{XY}}\)

Bootstrap Alpha
1 0.4483
2 0.5609
3 0.5053
4 0.6836
5 0.6108
6 0.5820
7 0.5013
8 0.5379
9 0.6151
10 0.5374


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Portfolio, statistic = statistic, R = 1000)


Bootstrap Statistics :
     original      bias    std. error
t1* 0.5758321 0.004719558  0.09020046

Bootstrap Regression Example

\(ln(wage)=\beta_0 + \beta_1 \: experience+ \beta_2 \: experience^2 + \beta_3 \: education + \beta_4 \: ethnicity + \varepsilon\)


Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education + 
    ethnicity, data = CPS1988)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9428 -0.3162  0.0580  0.3756  4.3830 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.321e+00  1.917e-02  225.38   <2e-16 ***
experience       7.747e-02  8.800e-04   88.03   <2e-16 ***
I(experience^2) -1.316e-03  1.899e-05  -69.31   <2e-16 ***
education        8.567e-02  1.272e-03   67.34   <2e-16 ***
ethnicityafam   -2.434e-01  1.292e-02  -18.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5839 on 28150 degrees of freedom
Multiple R-squared:  0.3347,    Adjusted R-squared:  0.3346 
F-statistic:  3541 on 4 and 28150 DF,  p-value: < 2.2e-16
Regression Coefficients CI
Variable 2.5% 97.5%
(Intercept) 4.28381 4.35898
experience 0.07575 0.07920
I(experience^2) -0.00135 -0.00128
education 0.08318 0.08817
ethnicityafam -0.26868 -0.21804
Bootstrapped Regression Coefficients CI
Variable 2.5% 97.5%
(Intercept) 4.28110 4.36191
experience 0.07548 0.07948
I(experience^2) -0.00136 -0.00127
education 0.08297 0.08836
ethnicityafam -0.26951 -0.21739