Regularization (Shrinkage)

I. Ozkan

Spring 2025

Preliminary Readings

Learning Objectives

Polynomial Regression

\(f(x) \approx f(x_0) + \frac{f'(x_0)}{1!} (x-x_0) + + \frac{f''(x_0)}{2!} (x-x_0)^2 + \cdots + \frac{f^k(x_0)}{k!} (x-x_0)^k + R\)

Where,

\(f'(x_0)=\frac{\mathrm d}{\mathrm d x} \big( f(x) \big)|_{x=x_0}, \; f''(x_0)=\frac{\mathrm d^2}{\mathrm d x^2} \big( f(x) \big)|_{x=x_0}\)

and

\(f^k(x_0)=\frac{\mathrm d^k}{\mathrm dx^k} \big( f(x) \big)|_{x=x_0}\)

\(f(x,y) \approx f(x_0,y_0)+\frac{\partial f(x_0,y_0)}{\partial x}(x-x_0)+\frac{\partial f(x_0,y_0)}{\partial y}(y-y_0) +\cdots\)

\(..+\frac{1}{2} \bigg(\frac{\partial^2 f(x_0,y_0)}{\partial x^2}(x-x_0)^2+\frac{\partial^2 f(x_0, y_0)}{\partial y^2}(y-y_0)^2+2\frac{\partial^2 f(x_0, y_0)}{\partial x\partial y}(x-x_0)(y-y_0)\bigg) + R\)

A simple Example

The correlation between income and score \(Cor(income, score) \approx 0.712\) indicates that districts with higher income tend to achieve higher test scores

Linear Model

\[Score_i = \beta_0 + \beta_1 \times income_i + \varepsilon_i\]

Dependent variable:
score
income 1.879***
(0.091)
Constant 625.384***
(1.532)
Observations 420
R2 0.508
Adjusted R2 0.506
Residual Std. Error 13.387 (df = 418)
F Statistic 430.830*** (df = 1; 418)
Note: p<0.1; p<0.05; p<0.01




\[\widehat{Score}_i = \underset{(1.53)}{625.38} + \underset{(0.09)}{1.88} \times income_i\]

A simple Example Quadratic Model

\[Score_i = \beta_0 + \beta_1 \times income_i + \beta_2 \times income_i^2 + \varepsilon_i\]

Dependent variable:
score
income 3.851***
(0.304)
I(income2) -0.042***
(0.006)
Constant 607.302***
(3.046)
Observations 420
R2 0.556
Adjusted R2 0.554
Residual Std. Error 12.724 (df = 417)
F Statistic 261.278*** (df = 2; 417)
Note: p<0.1; p<0.05; p<0.01



- Fitted Values in the figure show better estimated values


\(\widehat{Score}_i = \underset{(3.046)}{607.302} + \underset{(0.304)}{3.851} \times income_i \underset{(0.006)}{-0.042} \times income_i^2\)






Polynomial Regression

\(y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_k x_i^k + \varepsilon_i\)

Dependent variable:
score
income 5.019***
(0.859)
I(income2) -0.096**
(0.037)
I(income3) 0.001
(0.0005)
Constant 600.079***
(5.830)
Observations 420
R2 0.558
Adjusted R2 0.555
Residual Std. Error 12.707 (df = 416)
F Statistic 175.352*** (df = 3; 416)
Note: p<0.1; p<0.05; p<0.01


The estimated model:

\(\widehat{Score}_i = \underset{(5.83)}{600.079} + \underset{(0.859)}{5.019} \times income \underset{(0.037)}{-0.096} \times income^2 + \underset{(4.7\times 10^{-4})}{6.9\times 10^{-4}} \times income^3\)








Coefficients of Cubic Model and Model Selection


\(H_0: \beta_2=0, \ \beta_3=0 \ \ \ \text{vs.} \ \ \ H_1: \text{at least one} \ \beta_j\neq0, \ j=2,3\)

Analysis of Variance Table

Model 1: score ~ income
Model 2: score ~ income + I(income^2)
Model 3: score ~ income + I(income^2) + I(income^3)
  Res.Df   RSS Df Sum of Sq       F    Pr(>F)    
1    418 74905                                   
2    417 67510  1    7394.9 45.7985 4.469e-11 ***
3    416 67170  1     340.6  2.1096    0.1471    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Polynomial Model (degree > 3)

Dependent variable:
score
Degree 4 Degree 5 Degree 10
income 9.876*** 17.824*** -271.411
(2.354) (5.613) (324.454)
I(income2) -0.434*** -1.205** 74.358
(0.157) (0.519) (77.211)
I(income3) 0.010** 0.043** -10.733
(0.004) (0.022) (10.260)
I(income4) -0.0001** -0.001* 0.935
(0.00004) (0.0004) (0.845)
I(income5) 0.00000 -0.052
(0.00000) (0.045)
I(income6) 0.002
(0.002)
I(income7) -0.00004
(0.00004)
I(income8) 0.00000
(0.00000)
I(income9) -0.000
(0.000)
I(income10) 0.000
(0.000)
Constant 577.340*** 548.143*** 1,003.863*
(11.793) (22.119) (577.253)
Observations 420 420 420
R2 0.564 0.566 0.569
Adjusted R2 0.559 0.561 0.558
Residual Std. Error 12.648 (df = 415) 12.626 (df = 414) 12.666 (df = 409)
F Statistic 133.975*** (df = 4; 415) 108.036*** (df = 5; 414) 53.908*** (df = 10; 409)
Note: p<0.1; p<0.05; p<0.01

Polynomial Model (Higher Degrees)

Correlation: Polynomial Degree 4
1 2 3 4
1 1
2 0.96 1
3 0.87 0.97 1
4 0.78 0.92 0.98 1
Correlation: Polynomial Degree 5
1 2 3 4 5
1 1
2 0.96 1
3 0.87 0.97 1
4 0.78 0.92 0.98 1
5 0.7 0.85 0.95 0.99 1
Correlation: Polynomial Degree 10
1 2 3 4 5 6 7 8 9 10
1 1
2 0.96 1
3 0.87 0.97 1
4 0.78 0.92 0.98 1
5 0.7 0.85 0.95 0.99 1
6 0.63 0.8 0.9 0.97 0.99 1
7 0.58 0.74 0.86 0.94 0.98 1 1
8 0.53 0.7 0.82 0.91 0.96 0.98 1 1
9 0.5 0.66 0.79 0.88 0.94 0.97 0.99 1 1
10 0.47 0.63 0.76 0.85 0.92 0.96 0.98 0.99 1 1

Orthogonal Polynomial Model (degree > 3)

Dependent variable:
score
Degree 4 Degree 5 Degree 10
poly(income, 4)1 277.857***
(12.648)
poly(income, 4)2 -85.993***
(12.648)
poly(income, 4)3 18.456
(12.648)
poly(income, 4)4 -28.013**
(12.648)
poly(income, 5)1 277.857***
(12.626)
poly(income, 5)2 -85.993***
(12.626)
poly(income, 5)3 18.456
(12.626)
poly(income, 5)4 -28.013**
(12.626)
poly(income, 5)5 19.686
(12.626)
poly(income, 10)1 277.857***
(12.666)
poly(income, 10)2 -85.993***
(12.666)
poly(income, 10)3 18.456
(12.666)
poly(income, 10)4 -28.013**
(12.666)
poly(income, 10)5 19.686
(12.666)
poly(income, 10)6 -9.054
(12.666)
poly(income, 10)7 3.863
(12.666)
poly(income, 10)8 -5.500
(12.666)
poly(income, 10)9 1.712
(12.666)
poly(income, 10)10 15.736
(12.666)
Constant 654.157*** 654.157*** 654.157***
(0.617) (0.616) (0.618)
Observations 420 420 420
R2 0.564 0.566 0.569
Adjusted R2 0.559 0.561 0.558
Residual Std. Error 12.648 (df = 415) 12.626 (df = 414) 12.666 (df = 409)
F Statistic 133.975*** (df = 4; 415) 108.036*** (df = 5; 414) 53.908*** (df = 10; 409)
Note: p<0.1; p<0.05; p<0.01

Model Selection

\[Y=f(X)+\varepsilon=\beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_pX_p+\varepsilon\]

is used to describe the relationship between \(Y\) and \(X_1,X_2,\dots,X_p\) and one fits using LS.

If number of observations \(n>>p\) then the least squares estimates have low variance in general. If this is not the case, the variance of \(\hat f()\) increased (due to overfitting) and the prediction performance gets poorer. If \(p>n\) then the variance become infinite and the method can not be used at all. By shrinking (constraining) the coefficients the variance can be reduced at the cost of increasing bias.

Including irrelevant variables in the model leads to complexity that may be avoided by removing them setting their coefficients to zero. Feature selection (variable selection, input selection) helps us to exclude irrelevant variables.

Feature Selection for LS

Subset Selection

Best Subset Selection

For each combination of \(p\) predictors, least square regression is fitted. Then the best model is selected based on some performance measures (cross-validation, AIC, BIC, adjusted \(R_{adj}^2\)). For this case, there are \(2^p\) possibilities considered as candidate models and it is quite time-consuming (even non-trivial) task for some larger values of \(p\).

Stepwise Selection

There exists a computational difficulty for best subset selection when the number of predictors is greater than 40+. Even selecting the best performing model on training data may results in poor performance on the test data. This big search space increases the chance of having large variance models.

Forward Stepwise Selection

The total number of models fitted, \(1+\sum_{k=0}^{p-1}(p-k)=1+ p(p+1)/2\), is substantially smaller than the best subset selection

Backward Stepwise Selection

Hybrid Approaches

Example

Regularization

\[\hat\beta=(X^TX+\lambda I_d)^{-1}X^TY\]

where \(I_d\) is the diagonal unit matrix.

or equivalently the function below is minimized

\[\sum_{1}^{n}\Big(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij}\Big)^2 + \lambda \sum_{j=1}^{p} \beta_j^2\]

Ridge Regression Example (Polynomial Regression)

Let’ start with an example of polynomial regression of degree 10 with regularization (ridge regression) using mtcars data in R

\(mpg=\beta_0+\beta_1hp+\beta_2hp^2+\cdots+\beta_{10}hp^{10}+\varepsilon\)

Let’s first obtain OLS estimate of \(\hat \beta=\{\hat \beta_0,...,\hat \beta_{10}\}\) (orthogonal polynomial)


Call:
lm(formula = mpg ~ poly(hp, 10), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.2029 -1.3077 -0.4330  0.8111  9.4244 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     20.0906     0.5809  34.588  < 2e-16 ***
poly(hp, 10)1  -26.0456     3.2858  -7.927 9.54e-08 ***
poly(hp, 10)2   13.1546     3.2858   4.003 0.000644 ***
poly(hp, 10)3   -2.2419     3.2858  -0.682 0.502508    
poly(hp, 10)4    0.2957     3.2858   0.090 0.929152    
poly(hp, 10)5   -1.3394     3.2858  -0.408 0.687669    
poly(hp, 10)6   -4.7101     3.2858  -1.433 0.166448    
poly(hp, 10)7    0.4862     3.2858   0.148 0.883769    
poly(hp, 10)8    2.6894     3.2858   0.818 0.422273    
poly(hp, 10)9    3.0125     3.2858   0.917 0.369637    
poly(hp, 10)10  -1.5065     3.2858  -0.458 0.651318    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.286 on 21 degrees of freedom
Multiple R-squared:  0.7987,    Adjusted R-squared:  0.7028 
F-statistic:  8.33 on 10 and 21 DF,  p-value: 2.562e-05

Ridge Regression Example (Polynomial Regression)

Ridge Regression Example: Estimation



Fitted Coefficients
Selected Lambdas
Lambda: 125.893 Lambda: 0.013
beta-0 20.091 20.091
beta-1 -1.172 -25.990
beta-2 0.592 13.127
beta-3 -0.101 -2.237
beta-4 0.013 0.295
beta-5 -0.060 -1.337
beta-6 -0.212 -4.700
beta-7 0.022 0.485
beta-8 0.121 2.684
beta-9 0.136 3.006
beta-10 -0.068 -1.503

Ridge Regression Example: Prediction

Fitted Coefficients
Selected Lambdas
Lambda: 1000 Lambda: 3.162 Lambda: 0.01
beta-0 20.091 20.091 20.091
beta-1 -0.154 -16.989 -26.002
beta-2 0.078 8.580 13.132
beta-3 -0.013 -1.462 -2.238
beta-4 0.002 0.193 0.295
beta-5 -0.008 -0.874 -1.337
beta-6 -0.028 -3.072 -4.702
beta-7 0.003 0.317 0.485
beta-8 0.016 1.754 2.685
beta-9 0.018 1.965 3.007
beta-10 -0.009 -0.983 -1.504

Ridge Regression Example: Tuning

Fitted Coefficients

Using Optimum, Lambda

R-squared: 0.748
Lambda: 1.9953
beta-0 20.091
beta-1 -19.490
beta-2 9.844
beta-3 -1.678
beta-4 0.221
beta-5 -1.002
beta-6 -3.525
beta-7 0.364
beta-8 2.012
beta-9 2.254
beta-10 -1.127






Lasso Regression

Recall that ridge regression loss function is:

\(L(\beta)=(Y-X\beta)^T(Y-X\beta)+\lambda^2 \beta^T \beta\)

where penalty parameter \(\lambda\) is used. We can generalize the loss function so that:

\(L(\theta)=(Y-X\beta)^T(Y-X\beta)+\lambda^2 (|\beta|^q)^T |\beta|^q\)

\(=\sum_{i=0}^{n} (y_i-\beta^T \phi(x_i)) + \lambda \sum_{j=1}^{p} |\beta_j|^q\)

Where if \(q=2\) then it becomes ridge regression (\(\ell_2 \; norm\)). If \(q=1\) (\(\|\beta\|_1,\;\ell_1 \; norm\)) then it is called as least absolute shrinkage and selection operator, Lasso regression.

Variable Selection Property of the Lasso

Coefficients and Constraints
Coefficients and Constraints

Lasso Regression Example: Estimation



Lasso Fitted Coefficients
Selected Lambdas
Lambda: 125.893 Lambda: 0.013
beta-0 20.091 20.091
beta-1 0.000 -25.974
beta-2 0.000 13.083
beta-3 0.000 -2.171
beta-4 0.000 0.224
beta-5 0.000 -1.268
beta-6 0.000 -4.639
beta-7 0.000 0.415
beta-8 0.000 2.618
beta-9 0.000 2.941
beta-10 0.000 -1.435

Lasso Regression Example: Prediction

Lasso Fitted Coefficients
Selected Lambda Values
Lambda: 1000 Lambda: 3.162 Lambda: 0.01
beta-0 20.091 20.091 20.091
beta-1 0.000 -8.157 -25.989
beta-2 0.000 0.000 13.098
beta-3 0.000 0.000 -2.185
beta-4 0.000 0.000 0.239
beta-5 0.000 0.000 -1.283
beta-6 0.000 0.000 -4.653
beta-7 0.000 0.000 0.430
beta-8 0.000 0.000 2.633
beta-9 0.000 0.000 2.956
beta-10 0.000 0.000 -1.450

Lasso Regression: Tuning Example

Fitted Coefficients

Using Optimum, Lambda

R-squared: 0.748
Lambda: 0.631
beta-0 20.091
beta-1 -22.476
beta-2 9.585
beta-3 0.000
beta-4 0.000
beta-5 0.000
beta-6 -1.141
beta-7 0.000
beta-8 0.000
beta-9 0.000
beta-10 0.000






Lasso Regression

Here are some observations.

Elastic Net

\(min \; \frac{1}{2n} \sum_{i=0}^{n} (y_i-\beta^T \phi(x_i))^2 + \lambda_2 \sum_{j=1}^{p} |\beta_j| + \lambda_1 \sum_{j=1}^{p} \beta_j^2\)

if we set, \(\alpha=\frac{\lambda_2}{\lambda_1+\lambda_2}\) and \(\lambda=\lambda_1+\lambda_2\) then

\(min \; \frac{1}{2n} \sum_{i=0}^{n} (y_i-\beta^T \phi(x_i))^2 + \lambda \left((1-\alpha) \sum_{j=1}^{p} \beta_j^2+ \alpha \sum_{j=1}^{p} |\beta_j| \right)\)

\(min \; \frac{1}{2n} \sum_{i=0}^{n} (y_i-\beta^T \phi(x_i))^2 + \lambda \left(\frac{1-\alpha}{2} \sum_{j=1}^{p} \beta_j^2+ \alpha \sum_{j=1}^{p} |\beta_j| \right)\)

Elastic Net Example: Tuning

Fitted Coefficients

Tuned, R2: 0.76

Elastic Net: lambda = 0.917 Alpha = 0.694
Elastic_Net
beta-0 20.091
beta-1 -21.430
beta-2 9.123
beta-3 0.000
beta-4 0.000
beta-5 0.000
beta-6 -1.060
beta-7 0.000
beta-8 0.000
beta-9 0.000
beta-10 0.000

All Models Predictions