Regularization (Shrinkage)

I. Ozkan

Spring 2025

Preliminary Readings

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 6
Machine Learning and Data Mining Lecture Notes, Ridge regression and regularization
Regularization Paths for Generalized Linear Models via Coordinate Descent, J. H. Friedman, T. Hastie, R. Tibshirani
Hands-On Machine Learning with R, Bradley Boehmke & Brandon Greenwell

Learning Objectives

Polynomial Regression
Model Selection
Shrinkage (Regularization)
- Ridge Regression
- Lasso Regression
- Elastic Net Regression

Polynomial Regression

Linear Approximation may be used to approximate many non linear relationships successfully
If any nonlinear function is differentiable \(k\) times, then it is possible to approximate the function with a polynomial of order \(k\)
Taylor Series Expansion for a univariate function is that is approximated around \(x_0\) is given by:

\(f(x) \approx f(x_0) + \frac{f'(x_0)}{1!} (x-x_0) + + \frac{f''(x_0)}{2!} (x-x_0)^2 + \cdots + \frac{f^k(x_0)}{k!} (x-x_0)^k + R\)

Where,

\(f'(x_0)=\frac{\mathrm d}{\mathrm d x} \big( f(x) \big)|_{x=x_0}, \; f''(x_0)=\frac{\mathrm d^2}{\mathrm d x^2} \big( f(x) \big)|_{x=x_0}\)

and

\(f^k(x_0)=\frac{\mathrm d^k}{\mathrm dx^k} \big( f(x) \big)|_{x=x_0}\)

It is possible to extend the uni-variate Taylor Series approximation to multivariate. One example may be:

\(f(x,y) \approx f(x_0,y_0)+\frac{\partial f(x_0,y_0)}{\partial x}(x-x_0)+\frac{\partial f(x_0,y_0)}{\partial y}(y-y_0) +\cdots\)

\(..+\frac{1}{2} \bigg(\frac{\partial^2 f(x_0,y_0)}{\partial x^2}(x-x_0)^2+\frac{\partial^2 f(x_0, y_0)}{\partial y^2}(y-y_0)^2+2\frac{\partial^2 f(x_0, y_0)}{\partial x\partial y}(x-x_0)(y-y_0)\bigg) + R\)

A simple Example

This example is from Introduction to Econometrics with R Book, Section 8
California Test Score Data: test performance, school characteristics and student demographic backgrounds for school districts in California, AER package has CASchools data, type ?CASchools for more details
Adding two more variables:
- size: students per teachers
- score: average read and math score

The correlation between income and score \(Cor(income, score) \approx 0.712\) indicates that districts with higher income tend to achieve higher test scores

Linear Model

\[Score_i = \beta_0 + \beta_1 \times income_i + \varepsilon_i\]


	Dependent variable:

	score

income	1.879^***
	(0.091)

Constant	625.384^***
	(1.532)


Observations	420
R²	0.508
Adjusted R²	0.506
Residual Std. Error	13.387 (df = 418)
F Statistic	430.830^*** (df = 1; 418)

Note:	p<0.1; p<0.05; p<0.01

\[\widehat{Score}_i = \underset{(1.53)}{625.38} + \underset{(0.09)}{1.88} \times income_i\]

Figure suggests that there exists a inverse U shaped pattern exists (bottom figure) and linear model overestimates when the income is higher and underestimates when it is lower

A simple Example Quadratic Model

Let’s estimate a quadratic model of the form:

\[Score_i = \beta_0 + \beta_1 \times income_i + \beta_2 \times income_i^2 + \varepsilon_i\]


	Dependent variable:

	score

income	3.851^***
	(0.304)

I(income2)	-0.042^***
	(0.006)

Constant	607.302^***
	(3.046)


Observations	420
R²	0.556
Adjusted R²	0.554
Residual Std. Error	12.724 (df = 417)
F Statistic	261.278^*** (df = 2; 417)

Note:	p<0.1; p<0.05; p<0.01

- Fitted Values in the figure show better estimated values

The estimated model,

\(\widehat{Score}_i = \underset{(3.046)}{607.302} + \underset{(0.304)}{3.851} \times income_i \underset{(0.006)}{-0.042} \times income_i^2\)

One may use anova() function of R to test if the second model is significant. Alternatively it is easy to test if \(\beta_2=0\), with, \(t=\frac{(\hat \beta_2-0)}{SE(\hat \beta_2)}=-6.7585\). The null is rejected for common significant levels

Polynomial Regression

Polynomial Regression of degree \(k\) is given by;

\(y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_k x_i^k + \varepsilon_i\)

Let’s continue with the same example for \(3^{rd}\) degree polynomial (cubic model)


	Dependent variable:

	score

income	5.019^***
	(0.859)

I(income2)	-0.096^**
	(0.037)

I(income3)	0.001
	(0.0005)

Constant	600.079^***
	(5.830)


Observations	420
R²	0.558
Adjusted R²	0.555
Residual Std. Error	12.707 (df = 416)
F Statistic	175.352^*** (df = 3; 416)

Note:	p<0.1; p<0.05; p<0.01

The estimated model:

\(\widehat{Score}_i = \underset{(5.83)}{600.079} + \underset{(0.859)}{5.019} \times income \underset{(0.037)}{-0.096} \times income^2 + \underset{(4.7\times 10^{-4})}{6.9\times 10^{-4}} \times income^3\)

Coefficients of Cubic Model and Model Selection

Here is the figure showing coefficients estimation

In order to test if the true relation is linear against it is a cubic;

\(H_0: \beta_2=0, \ \beta_3=0 \ \ \ \text{vs.} \ \ \ H_1: \text{at least one} \ \beta_j\neq0, \ j=2,3\)

One may use F-test (in this case the hypothesis matrix, restriction, is used)
The F-test does not rule out if the quadratic model has better or worse performance then cubic model. The F-test may be applied sequential or anova test can be performed

Analysis of Variance Table

Model 1: score ~ income
Model 2: score ~ income + I(income^2)
Model 3: score ~ income + I(income^2) + I(income^3)
  Res.Df   RSS Df Sum of Sq       F    Pr(>F)    
1    418 74905                                   
2    417 67510  1    7394.9 45.7985 4.469e-11 ***
3    416 67170  1     340.6  2.1096    0.1471    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Polynomial Model (degree > 3)


	Dependent variable:

	score
	Degree 4	Degree 5	Degree 10

income	9.876^***	17.824^***	-271.411
	(2.354)	(5.613)	(324.454)

I(income2)	-0.434^***	-1.205^**	74.358
	(0.157)	(0.519)	(77.211)

I(income3)	0.010^**	0.043^**	-10.733
	(0.004)	(0.022)	(10.260)

I(income4)	-0.0001^**	-0.001^*	0.935
	(0.00004)	(0.0004)	(0.845)

I(income5)		0.00000	-0.052
		(0.00000)	(0.045)

I(income6)			0.002
			(0.002)

I(income7)			-0.00004
			(0.00004)

I(income8)			0.00000
			(0.00000)

I(income9)			-0.000
			(0.000)

I(income10)			0.000
			(0.000)

Constant	577.340^***	548.143^***	1,003.863^*
	(11.793)	(22.119)	(577.253)


Observations	420	420	420
R²	0.564	0.566	0.569
Adjusted R²	0.559	0.561	0.558
Residual Std. Error	12.648 (df = 415)	12.626 (df = 414)	12.666 (df = 409)
F Statistic	133.975^*** (df = 4; 415)	108.036^*** (df = 5; 414)	53.908^*** (df = 10; 409)

Note:	p<0.1; p<0.05; p<0.01

Polynomial Model (Higher Degrees)

Polynomial Models of higher degrees may be cumbersome
The right hand side of the equation has a correlation matrix with some high values

Correlation: Polynomial Degree 4
	1	2	3	4
1	1
2	0.96	1
3	0.87	0.97	1
4	0.78	0.92	0.98	1

Correlation: Polynomial Degree 5
	1	2	3	4	5
1	1
2	0.96	1
3	0.87	0.97	1
4	0.78	0.92	0.98	1
5	0.7	0.85	0.95	0.99	1

Correlation: Polynomial Degree 10
	1	2	3	4	5	6	7	8	9	10
1	1
2	0.96	1
3	0.87	0.97	1
4	0.78	0.92	0.98	1
5	0.7	0.85	0.95	0.99	1
6	0.63	0.8	0.9	0.97	0.99	1
7	0.58	0.74	0.86	0.94	0.98	1	1
8	0.53	0.7	0.82	0.91	0.96	0.98	1	1
9	0.5	0.66	0.79	0.88	0.94	0.97	0.99	1	1
10	0.47	0.63	0.76	0.85	0.92	0.96	0.98	0.99	1	1

To overcome this highly correlated predictors, one may choose to perform orthogonal polynomial regression. The con of this approach is the difficulty of interpretation. But this is a valid problem for any polynomial with a higher degrees than, let’s say, three

Orthogonal Polynomial Model (degree > 3)

The R² for both models do not change


	Dependent variable:

	score
	Degree 4	Degree 5	Degree 10

poly(income, 4)1	277.857^***
	(12.648)

poly(income, 4)2	-85.993^***
	(12.648)

poly(income, 4)3	18.456
	(12.648)

poly(income, 4)4	-28.013^**
	(12.648)

poly(income, 5)1		277.857^***
		(12.626)

poly(income, 5)2		-85.993^***
		(12.626)

poly(income, 5)3		18.456
		(12.626)

poly(income, 5)4		-28.013^**
		(12.626)

poly(income, 5)5		19.686
		(12.626)

poly(income, 10)1			277.857^***
			(12.666)

poly(income, 10)2			-85.993^***
			(12.666)

poly(income, 10)3			18.456
			(12.666)

poly(income, 10)4			-28.013^**
			(12.666)

poly(income, 10)5			19.686
			(12.666)

poly(income, 10)6			-9.054
			(12.666)

poly(income, 10)7			3.863
			(12.666)

poly(income, 10)8			-5.500
			(12.666)

poly(income, 10)9			1.712
			(12.666)

poly(income, 10)10			15.736
			(12.666)

Constant	654.157^***	654.157^***	654.157^***
	(0.617)	(0.616)	(0.618)


Observations	420	420	420
R²	0.564	0.566	0.569
Adjusted R²	0.559	0.561	0.558
Residual Std. Error	12.648 (df = 415)	12.626 (df = 414)	12.666 (df = 409)
F Statistic	133.975^*** (df = 4; 415)	108.036^*** (df = 5; 414)	53.908^*** (df = 10; 409)

Note:	p<0.1; p<0.05; p<0.01

Model Selection

Recall that the standard linear model

\[Y=f(X)+\varepsilon=\beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_pX_p+\varepsilon\]

is used to describe the relationship between \(Y\) and \(X_1,X_2,\dots,X_p\) and one fits using LS.

Is it possible to use another fitting procedure that may yield better prediction accuracy and model interpretability

If number of observations \(n>>p\) then the least squares estimates have low variance in general. If this is not the case, the variance of \(\hat f()\) increased (due to overfitting) and the prediction performance gets poorer. If \(p>n\) then the variance become infinite and the method can not be used at all. By shrinking (constraining) the coefficients the variance can be reduced at the cost of increasing bias.

Including irrelevant variables in the model leads to complexity that may be avoided by removing them setting their coefficients to zero. Feature selection (variable selection, input selection) helps us to exclude irrelevant variables.

Feature Selection for LS

Subset Selection: Identify a subset of the \(p\) predictors that are related to \(Y\) and then fit the model
Shrinkage: Fit a model involving all predictors so that the coefficients are shrunken towards zero. Shrinkage (known as regularization) reduces the variance of \(\hat f()\)
Dimension Reduction: Project all \(p\) predictors into M-dimensional space where \(M<p\) by computing M different linear combinations (or projections)

Subset Selection

Best Subset Selection

For each combination of \(p\) predictors, least square regression is fitted. Then the best model is selected based on some performance measures (cross-validation, AIC, BIC, adjusted \(R_{adj}^2\)). For this case, there are \(2^p\) possibilities considered as candidate models and it is quite time-consuming (even non-trivial) task for some larger values of \(p\).

Stepwise Selection

There exists a computational difficulty for best subset selection when the number of predictors is greater than 40+. Even selecting the best performing model on training data may results in poor performance on the test data. This big search space increases the chance of having large variance models.

Forward Stepwise Selection

Start with a model with no predictors (\(M_0\))
Add predictors to the model one at a time until all the predictors are in the model. At each step, the variable that gives the maximum additional improvement to the fit is added

The total number of models fitted, \(1+\sum_{k=0}^{p-1}(p-k)=1+ p(p+1)/2\), is substantially smaller than the best subset selection

Backward Stepwise Selection

Begin with full least squares model containing all \(p\) predictors
Remove the least useful predictor one at a time at each iteration

Hybrid Approaches

Forward and Backward stepwise selection result in similar but not identical models
Predictors are added to the model sequentially like forward stepwise selection
At each step of including predictors, the predictor that adds no improvement to the model is removed

Example

Credit Card Balance Data (from ISLR2 package)
Let’s apply best subset selection using leaps package. The figure below shows three selection criteria

From top-left to bottom right, the number of variables and; i) Adjusted \(R2\), ii) \(BIC\), iii) \(RSS\) and iv) Mallow’s \(C_p\) (which is an unbiased estimate of test MSE) given as \(C_p=\frac{1}{n}(RSS+2d\hat\sigma^2)\), where, \(d\) is the number of predictors
Mallow’s \(C_p\) indicates 6 predictors: Income, Limit, Rating, Cards, Age, StudentYes

Regularization

Recall the LSE \(\hat\beta=(X^TX)^{-1}X^TY\)
The inverse of a matrix may lead to a problem if the equations are poorly conditioned
Regularization is a technique that is used for modifying the function by adding an additional penalty term \(\delta\) to the diagonal of \(X^TX\)

\[\hat\beta=(X^TX+\lambda I_d)^{-1}X^TY\]

where \(I_d\) is the diagonal unit matrix.

This is the solution of the following loss function:
\(L(\theta)=(Y-X\beta)^T(Y-X\beta)+\lambda^2 \beta^T \beta\)

or equivalently the function below is minimized

\[\sum_{1}^{n}\Big(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij}\Big)^2 + \lambda \sum_{j=1}^{p} \beta_j^2\]

This is called as Ridge Regression estimate. (can be derived it by taking the derivative of loss function w.r.t. \(\beta\))
This is also called shrinkage method
The additional penalty term regularizer, shrinkage penalty) controls the excessively fluctuating function
If \(\lambda\) is a very high number then the minimizing loss function results in \(\hat \beta \to 0, \: as \: \lambda \to \infty\)
If \(\lambda\) is a very small number then the minimizing loss function results in \(\beta=MLE \: estimation \; \hat \beta \: \:as \: \: \delta \to 0\)
Selecting a good value for \(\lambda\) is critical

Ridge Regression Example (Polynomial Regression)

Let’ start with an example of polynomial regression of degree 10 with regularization (ridge regression) using mtcars data in R

\(mpg=\beta_0+\beta_1hp+\beta_2hp^2+\cdots+\beta_{10}hp^{10}+\varepsilon\)

Let’s first obtain OLS estimate of \(\hat \beta=\{\hat \beta_0,...,\hat \beta_{10}\}\) (orthogonal polynomial)


Call:
lm(formula = mpg ~ poly(hp, 10), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.2029 -1.3077 -0.4330  0.8111  9.4244 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     20.0906     0.5809  34.588  < 2e-16 ***
poly(hp, 10)1  -26.0456     3.2858  -7.927 9.54e-08 ***
poly(hp, 10)2   13.1546     3.2858   4.003 0.000644 ***
poly(hp, 10)3   -2.2419     3.2858  -0.682 0.502508    
poly(hp, 10)4    0.2957     3.2858   0.090 0.929152    
poly(hp, 10)5   -1.3394     3.2858  -0.408 0.687669    
poly(hp, 10)6   -4.7101     3.2858  -1.433 0.166448    
poly(hp, 10)7    0.4862     3.2858   0.148 0.883769    
poly(hp, 10)8    2.6894     3.2858   0.818 0.422273    
poly(hp, 10)9    3.0125     3.2858   0.917 0.369637    
poly(hp, 10)10  -1.5065     3.2858  -0.458 0.651318    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.286 on 21 degrees of freedom
Multiple R-squared:  0.7987,    Adjusted R-squared:  0.7028 
F-statistic:  8.33 on 10 and 21 DF,  p-value: 2.562e-05

Ridge Regression Example (Polynomial Regression)

It seems that only the first three estimates are significant
The following graph shows the fit of this function.

Ridge Regression Example: Estimation

Let’s fit Ridge Regression model with different \(\lambda\) values (in the below example, \(0.01\leq \lambda \leq 1000\))

	Lambda: 125.893	Lambda: 0.013
Fitted Coefficients
Selected Lambdas
beta-0	20.091	20.091
beta-1	-1.172	-25.990
beta-2	0.592	13.127
beta-3	-0.101	-2.237
beta-4	0.013	0.295
beta-5	-0.060	-1.337
beta-6	-0.212	-4.700
beta-7	0.022	0.485
beta-8	0.121	2.684
beta-9	0.136	3.006
beta-10	-0.068	-1.503

Ridge Regression Example: Prediction

	Lambda: 1000	Lambda: 3.162	Lambda: 0.01
Fitted Coefficients
Selected Lambdas
beta-0	20.091	20.091	20.091
beta-1	-0.154	-16.989	-26.002
beta-2	0.078	8.580	13.132
beta-3	-0.013	-1.462	-2.238
beta-4	0.002	0.193	0.295
beta-5	-0.008	-0.874	-1.337
beta-6	-0.028	-3.072	-4.702
beta-7	0.003	0.317	0.485
beta-8	0.016	1.754	2.685
beta-9	0.018	1.965	3.007
beta-10	-0.009	-0.983	-1.504

Ridge Regression Example: Tuning

To find the optimum values for \(\lambda\) cross-validation is used. For this case, several fit is obtained with different \(\lambda\) values and then the optimal \(\lambda\) is chosen based on the generalization (checking the mean squared error)

	Lambda: 1.9953
Fitted Coefficients Using Optimum, Lambda
R-squared: 0.748
beta-0	20.091
beta-1	-19.490
beta-2	9.844
beta-3	-1.678
beta-4	0.221
beta-5	-1.002
beta-6	-3.525
beta-7	0.364
beta-8	2.012
beta-9	2.254
beta-10	-1.127

let \(\| \beta \|_2=\sqrt{\sum_{j=1}^{p} \beta_j^2}\), which is called \(\ell_2 \; norm\), as \(\lambda\) increases always \(\|\beta\|_2\) decreases
Coefficients obtained by least Squares estimation (OLS), \(\lambda=0\), are scale equinvariant. If any predictors multiplied by a constant \(c\), then the coefficient estimate for this predictor is scaled by a factor \(1/c\). But, Ridge regression coefficient estimates can change substantially. Hence, standardizing (scaling with standard deviation) all predictors before estimating may be the best pre-processing.
As \(\lambda\) increases, the variance of \(\hat f()\) decreases but the bias increases. Hence, the cost of a slight increase bias may result in decrease in the variance

Lasso Regression

Ridge regression include all \(p\) predictors. The penalty term shrinks all of the coefficients towards zero but does not set to zero

Recall that ridge regression loss function is:

\(L(\beta)=(Y-X\beta)^T(Y-X\beta)+\lambda^2 \beta^T \beta\)

where penalty parameter \(\lambda\) is used. We can generalize the loss function so that:

\(L(\theta)=(Y-X\beta)^T(Y-X\beta)+\lambda^2 (|\beta|^q)^T |\beta|^q\)

\(=\sum_{i=0}^{n} (y_i-\beta^T \phi(x_i)) + \lambda \sum_{j=1}^{p} |\beta_j|^q\)

Where if \(q=2\) then it becomes ridge regression (\(\ell_2 \; norm\)). If \(q=1\) (\(\|\beta\|_1,\;\ell_1 \; norm\)) then it is called as least absolute shrinkage and selection operator, Lasso regression.

Variable Selection Property of the Lasso

Coefficients and Constraints

Shaded areas as diamond (\(|\beta_1|+|\beta_2| < s, s \propto \lambda\)) and circle (\(\beta_1^2+\beta_2^2 < s\)) show the constraint function for lasso and ridge regression respectively. Contours are \(RSS\) and \(\hat \beta\) represents least squares estimates
Since the constraint has a diamond (polyhedron for three coefficients, or polytope for more than three coefficients) shape, the optimum value should be on one of the corners in lasso regression. Hence, one of the coefficient is set to zero

Lasso Regression Example: Estimation

Let’s fit Lassı Regression model with different \(\lambda\) values (in the below example, \(0.01\leq \lambda \leq 1000\))

	Lambda: 125.893	Lambda: 0.013
Lasso Fitted Coefficients
Selected Lambdas
beta-0	20.091	20.091
beta-1	0.000	-25.974
beta-2	0.000	13.083
beta-3	0.000	-2.171
beta-4	0.000	0.224
beta-5	0.000	-1.268
beta-6	0.000	-4.639
beta-7	0.000	0.415
beta-8	0.000	2.618
beta-9	0.000	2.941
beta-10	0.000	-1.435

Lasso Regression Example: Prediction

	Lambda: 1000	Lambda: 3.162	Lambda: 0.01
Lasso Fitted Coefficients
Selected Lambda Values
beta-0	20.091	20.091	20.091
beta-1	0.000	-8.157	-25.989
beta-2	0.000	0.000	13.098
beta-3	0.000	0.000	-2.185
beta-4	0.000	0.000	0.239
beta-5	0.000	0.000	-1.283
beta-6	0.000	0.000	-4.653
beta-7	0.000	0.000	0.430
beta-8	0.000	0.000	2.633
beta-9	0.000	0.000	2.956
beta-10	0.000	0.000	-1.450

Lasso Regression: Tuning Example

Cross-validation results shown in below to find once more the optimum \(\lambda\), with lasso regression

	Lambda: 0.631
Fitted Coefficients Using Optimum, Lambda
R-squared: 0.748
beta-0	20.091
beta-1	-22.476
beta-2	9.585
beta-3	0.000
beta-4	0.000
beta-5	0.000
beta-6	-1.141
beta-7	0.000
beta-8	0.000
beta-9	0.000
beta-10	0.000

Lasso Regression

Here are some observations.

Lasso set some of the coefficients to zero. Hence it can be used for feature (input, a.k.a., independent variables that matter) selection. The final selected model may have only subset of the predictors.
Both Lasso and ridge handle correlated predictors. In ridge regression, correlated predictors coefficients are similar. In lasso, one of them is large but the other are set to (almost) zero.
If the most of the predictors have significant impact the dependent variable, ridge regression tends to perform better. If the small number of predictors significantly impact the dependent variable then lasso performs better in general. But in practice it is always good to perform cross-validation and check all regression results.
The next question then, “is it possible to use both lasso and ridge together?”

Elastic Net

Elastic net uses penalties similar to both ridge and lasso.

\(min \; \frac{1}{2n} \sum_{i=0}^{n} (y_i-\beta^T \phi(x_i))^2 + \lambda_2 \sum_{j=1}^{p} |\beta_j| + \lambda_1 \sum_{j=1}^{p} \beta_j^2\)

if we set, \(\alpha=\frac{\lambda_2}{\lambda_1+\lambda_2}\) and \(\lambda=\lambda_1+\lambda_2\) then

\(min \; \frac{1}{2n} \sum_{i=0}^{n} (y_i-\beta^T \phi(x_i))^2 + \lambda \left((1-\alpha) \sum_{j=1}^{p} \beta_j^2+ \alpha \sum_{j=1}^{p} |\beta_j| \right)\)

When \(\alpha=0\) it corresponds to ridge regression and lasso when it is one.
For the interpolation between \(\ell_1 \; norm\) and \(\ell_2 \; norm\) the following minimization performed, in general,

\(min \; \frac{1}{2n} \sum_{i=0}^{n} (y_i-\beta^T \phi(x_i))^2 + \lambda \left(\frac{1-\alpha}{2} \sum_{j=1}^{p} \beta_j^2+ \alpha \sum_{j=1}^{p} |\beta_j| \right)\)

\(\alpha\) values are chosen such that \(\alpha \in [0,1]\).

Elastic Net Example: Tuning

	Elastic_Net
Fitted Coefficients Tuned, R2: 0.76
Elastic Net: lambda = 0.917 Alpha = 0.694
beta-0	20.091
beta-1	-21.430
beta-2	9.123
beta-3	0.000
beta-4	0.000
beta-5	0.000
beta-6	-1.060
beta-7	0.000
beta-8	0.000
beta-9	0.000
beta-10	0.000

Regularization (Shrinkage)

Preliminary Readings

Learning Objectives

Polynomial Regression

A simple Example

Linear Model

A simple Example Quadratic Model

Polynomial Regression

Coefficients of Cubic Model and Model Selection

Polynomial Model (degree > 3)

Polynomial Model (Higher Degrees)

Orthogonal Polynomial Model (degree > 3)

Model Selection

Feature Selection for LS

Subset Selection

Example

Regularization

Ridge Regression Example (Polynomial Regression)

Ridge Regression Example (Polynomial Regression)

Ridge Regression Example: Estimation

Ridge Regression Example: Prediction

Ridge Regression Example: Tuning

Lasso Regression

Variable Selection Property of the Lasso

Lasso Regression Example: Estimation

Lasso Regression Example: Prediction

Lasso Regression: Tuning Example

Lasso Regression

Elastic Net

Elastic Net Example: Tuning

All Models Predictions