I. Ozkan
Spring 2025
(k-fold) Cross Validation
Repeated Cross Validation
Leave-One-Out Cross-Validation
Bootstrap
Data set is split into two subsets (in general) called training data and test data (validation set, hold-out set) (often called validation or hold-out approach)
Training data set is used to fit the model and test data is used to assess the performance of the modes (using observations that are new to the model)
The training error rate, often is quite different from the test error rate. The former can dramatically underestimate the latter.
To better estimate the test error rate, one approach is to hold out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations
If there is non-linear relationship between response and predictor, polynomial regression may be used and choosing the polynomial degree is important.
The estimated coefficients can be assessed with p-values
Calibration, model selection and assessment of coefficients can be performed using validation set (Model Evaluation: See related slides)
Example: Auto data set of ISLR package is used for illustration
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name |
---|---|---|---|---|---|---|---|---|
18 | 8 | 307 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
18 | 8 | 318 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
16 | 8 | 304 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
15 | 8 | 429 | 198 | 4341 | 10.0 | 70 | 1 | ford galaxie 500 |
Validation set is obtained by randomly selecting training and test data. In this example, 30% of the observations are selected randomly for validation
The models shown are polynomial models (orthogonal). The linear model is \(mpg=\beta_0+\beta_1 \cdot horsepower+\varepsilon\)
Assess the change in \(MSE\) for both training and test data set
As it is seen from both the figure given previous slide and the table given below, the test data set MSE values do suggest to use second degree polynomial since increasing the degree of polynomial does not provide highly significant performance for all randomly split samples
Since randomly selected training data sets are used to fit the model, test data set MSEs are generally over estimated (seen from the figure on the previous slide)
Polynomial Degree | Min. Training MSE | Min. Test MSE |
---|---|---|
1 | 22.291 | 20.416 |
2 | 17.493 | 16.767 |
3 | 17.350 | 16.701 |
4 | 17.273 | 16.674 |
5 | 16.817 | 16.025 |
6 | 16.674 | 15.710 |
7 | 16.529 | 15.381 |
8 | 16.529 | 15.420 |
9 | 16.522 | 15.667 |
10 | 16.517 | 16.088 |
Similar to validation set approach, but, the validation contains only one observation
For each observations, \((x_j, y_j), \: j=1,2,\dots,n\) use remaining \(n-1\) observations as training set, then predict \(\hat y_j\) and calculate, \(MSE_j=(y_j-\hat y_j)^2\)
The average of the test error MSEs is LOOCV estimate
\(CV_{(n)} = \frac{1}{n}\sum^n_{i=1}MSE_i\)
LOOCV estimate for linear regression,
\(mpg_i=\beta_0+\beta_1horsepower_i+\varepsilon_i\) is: 24.232
LOOCV approach is very computationally expensive for large data sets.
An alternative approach is to use k-Fold Cross Validation
Randomly split the data into k-groups (folds) of equal size
Use the first set as a validation set. Fit the model using remaining data. Calculate \(MSE_1\)
Repeat above step for the remaining k-1 groups and calculate \(MSE_j, \: j=2,\dots k\)
Thus, k-fold Cross Validation estimate is
\(CV_{(k)} = \frac{1}{k}\sum^k_{j=1}MSE_j\)
LOOCV is a special case of k-fold approach where k is set to number of observations
Five to ten folds produce a good estimates for test error rates
Bootstrapping is a very powerful statistical tool to quantify the uncertainty associated with a given estimator
It involves, repeatedly drawing independent samples from data set with replacement
An example of a data set with three observation (n=3) and samples drawn from this set are shown in the figure below:
Each bootstrap data contains, \(Z^{*1}, Z^{*2}, \dots, Z^{*B}\), three observations (n=3) and it is used to compute the estimated statistic we are interested in say, \(\hat \alpha^*\)
All the bootstrapped data sets are used to compute the standard error of \(\hat\alpha^{*1}, \hat\alpha^{*2}, \dots, \hat\alpha^{*B}\)
\(SE_B(\hat\alpha) = \sqrt{\frac{1}{B-1}\sum^B_{r=1}\bigg(\hat\alpha^{*r}-\frac{1}{B}\sum^B_{r'=1}\hat\alpha^{*r}\bigg)^2}\)
“Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of \(X\) and \(Y\) , respectively, where \(X\) and \(Y\) are random quantities. We will invest a fraction \(\alpha\) of our money in \(X\), and will invest the remaining \(1-\alpha\) in \(Y\)”
As variance is used for a risk measurement we want to minimize the variance, \(Var(\alpha X+(1-\alpha)Y)\), the estimated value of \(\hat \alpha\) that minimizes the risk is:
\(\hat\alpha = \frac{\hat\sigma^2_Y - \hat\sigma_{XY}}{\hat\sigma^2_X +\hat\sigma^2_Y-2\hat\sigma_{XY}}\)
The estimate of \(\hat \alpha\)
for Portfolio
data of ISLR
package is
0.5758
Estimated \(\hat \alpha\) for 10
Bootstrapped samples using Portfolio
data of
ISLR
package is shown below.
Bootstrap | Alpha |
---|---|
1 | 0.4483 |
2 | 0.5609 |
3 | 0.5053 |
4 | 0.6836 |
5 | 0.6108 |
6 | 0.5820 |
7 | 0.5013 |
8 | 0.5379 |
9 | 0.6151 |
10 | 0.5374 |
The average of estimated \(\hat \alpha=0.5583\)
Let’s obtain the the distribution of the \(\hat \alpha\) using 1000 bootstrap replications
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Portfolio, statistic = statistic, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 0.5758321 0.004719558 0.09020046
Let’s use wage and education data comes with AER
package
The data contains, \(wage, education, experience, ethnicity\) variables together that we are going to use as an example
The basic model is:
\(ln(wage)=\beta_0 + \beta_1 \: experience+ \beta_2 \: experience^2 + \beta_3 \: education + \beta_4 \: ethnicity + \varepsilon\)
Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education +
ethnicity, data = CPS1988)
Residuals:
Min 1Q Median 3Q Max
-2.9428 -0.3162 0.0580 0.3756 4.3830
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.321e+00 1.917e-02 225.38 <2e-16 ***
experience 7.747e-02 8.800e-04 88.03 <2e-16 ***
I(experience^2) -1.316e-03 1.899e-05 -69.31 <2e-16 ***
education 8.567e-02 1.272e-03 67.34 <2e-16 ***
ethnicityafam -2.434e-01 1.292e-02 -18.84 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5839 on 28150 degrees of freedom
Multiple R-squared: 0.3347, Adjusted R-squared: 0.3346
F-statistic: 3541 on 4 and 28150 DF, p-value: < 2.2e-16
Variable | 2.5% | 97.5% |
---|---|---|
(Intercept) | 4.28381 | 4.35898 |
experience | 0.07575 | 0.07920 |
I(experience^2) | -0.00135 | -0.00128 |
education | 0.08318 | 0.08817 |
ethnicityafam | -0.26868 | -0.21804 |
Variable | 2.5% | 97.5% |
---|---|---|
(Intercept) | 4.28110 | 4.36191 |
experience | 0.07548 | 0.07948 |
I(experience^2) | -0.00136 | -0.00127 |
education | 0.08297 | 0.08836 |
ethnicityafam | -0.26951 | -0.21739 |