I. Ozkan
Spring 2025
Polynomial Regression
Model Selection
Shrinkage (Regularization)
Linear Approximation may be used to approximate many non linear relationships successfully
If any nonlinear function is differentiable \(k\) times, then it is possible to approximate the function with a polynomial of order \(k\)
Taylor Series Expansion for a univariate function is that is approximated around \(x_0\) is given by:
\(f(x) \approx f(x_0) + \frac{f'(x_0)}{1!} (x-x_0) + + \frac{f''(x_0)}{2!} (x-x_0)^2 + \cdots + \frac{f^k(x_0)}{k!} (x-x_0)^k + R\)
Where,
\(f'(x_0)=\frac{\mathrm d}{\mathrm d x} \big( f(x) \big)|_{x=x_0}, \; f''(x_0)=\frac{\mathrm d^2}{\mathrm d x^2} \big( f(x) \big)|_{x=x_0}\)
and
\(f^k(x_0)=\frac{\mathrm d^k}{\mathrm dx^k} \big( f(x) \big)|_{x=x_0}\)
\(f(x,y) \approx f(x_0,y_0)+\frac{\partial f(x_0,y_0)}{\partial x}(x-x_0)+\frac{\partial f(x_0,y_0)}{\partial y}(y-y_0) +\cdots\)
\(..+\frac{1}{2} \bigg(\frac{\partial^2 f(x_0,y_0)}{\partial x^2}(x-x_0)^2+\frac{\partial^2 f(x_0, y_0)}{\partial y^2}(y-y_0)^2+2\frac{\partial^2 f(x_0, y_0)}{\partial x\partial y}(x-x_0)(y-y_0)\bigg) + R\)
This example is from Introduction to Econometrics with R Book, Section 8
California Test Score Data: test performance, school characteristics and student demographic backgrounds for school districts in California, AER package has CASchools data, type ?CASchools for more details
Adding two more variables:
size: students per teachersscore: average read and math scoreThe correlation between income and score \(Cor(income, score) \approx 0.712\) indicates that districts with higher income tend to achieve higher test scores
\[Score_i = \beta_0 + \beta_1 \times income_i + \varepsilon_i\]
| Dependent variable: | |
| score | |
| income | 1.879*** | 
| (0.091) | |
| Constant | 625.384*** | 
| (1.532) | |
| Observations | 420 | 
| R2 | 0.508 | 
| Adjusted R2 | 0.506 | 
| Residual Std. Error | 13.387 (df = 418) | 
| F Statistic | 430.830*** (df = 1; 418) | 
| Note: | p<0.1; p<0.05; p<0.01 | 
\[\widehat{Score}_i =
\underset{(1.53)}{625.38} + \underset{(0.09)}{1.88} \times
income_i\] 
\[Score_i = \beta_0 + \beta_1 \times income_i + \beta_2 \times income_i^2 + \varepsilon_i\]
| Dependent variable: | |
| score | |
| income | 3.851*** | 
| (0.304) | |
| I(income2) | -0.042*** | 
| (0.006) | |
| Constant | 607.302*** | 
| (3.046) | |
| Observations | 420 | 
| R2 | 0.556 | 
| Adjusted R2 | 0.554 | 
| Residual Std. Error | 12.724 (df = 417) | 
| F Statistic | 261.278*** (df = 2; 417) | 
| Note: | p<0.1; p<0.05; p<0.01 | 
- Fitted Values in the figure show better estimated values
\(\widehat{Score}_i = \underset{(3.046)}{607.302} + \underset{(0.304)}{3.851} \times income_i \underset{(0.006)}{-0.042} \times income_i^2\)
anova() function of R to test if the second
model is significant. Alternatively it is easy to test if \(\beta_2=0\), with, \(t=\frac{(\hat \beta_2-0)}{SE(\hat
\beta_2)}=-6.7585\). The null is rejected for common significant
levels\(y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_k x_i^k + \varepsilon_i\)
| Dependent variable: | |
| score | |
| income | 5.019*** | 
| (0.859) | |
| I(income2) | -0.096** | 
| (0.037) | |
| I(income3) | 0.001 | 
| (0.0005) | |
| Constant | 600.079*** | 
| (5.830) | |
| Observations | 420 | 
| R2 | 0.558 | 
| Adjusted R2 | 0.555 | 
| Residual Std. Error | 12.707 (df = 416) | 
| F Statistic | 175.352*** (df = 3; 416) | 
| Note: | p<0.1; p<0.05; p<0.01 | 
 The estimated model:
\(\widehat{Score}_i = \underset{(5.83)}{600.079} + \underset{(0.859)}{5.019} \times income \underset{(0.037)}{-0.096} \times income^2 + \underset{(4.7\times 10^{-4})}{6.9\times 10^{-4}} \times income^3\)
\(H_0: \beta_2=0, \ \beta_3=0 \ \ \ \text{vs.} \ \ \ H_1: \text{at least one} \ \beta_j\neq0, \ j=2,3\)
One may use F-test (in this case the hypothesis matrix, restriction, is used)
The F-test does not rule out if the quadratic model has better or worse performance then cubic model. The F-test may be applied sequential or anova test can be performed
Analysis of Variance Table
Model 1: score ~ income
Model 2: score ~ income + I(income^2)
Model 3: score ~ income + I(income^2) + I(income^3)
  Res.Df   RSS Df Sum of Sq       F    Pr(>F)    
1    418 74905                                   
2    417 67510  1    7394.9 45.7985 4.469e-11 ***
3    416 67170  1     340.6  2.1096    0.1471    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Dependent variable: | |||
| score | |||
| Degree 4 | Degree 5 | Degree 10 | |
| income | 9.876*** | 17.824*** | -271.411 | 
| (2.354) | (5.613) | (324.454) | |
| I(income2) | -0.434*** | -1.205** | 74.358 | 
| (0.157) | (0.519) | (77.211) | |
| I(income3) | 0.010** | 0.043** | -10.733 | 
| (0.004) | (0.022) | (10.260) | |
| I(income4) | -0.0001** | -0.001* | 0.935 | 
| (0.00004) | (0.0004) | (0.845) | |
| I(income5) | 0.00000 | -0.052 | |
| (0.00000) | (0.045) | ||
| I(income6) | 0.002 | ||
| (0.002) | |||
| I(income7) | -0.00004 | ||
| (0.00004) | |||
| I(income8) | 0.00000 | ||
| (0.00000) | |||
| I(income9) | -0.000 | ||
| (0.000) | |||
| I(income10) | 0.000 | ||
| (0.000) | |||
| Constant | 577.340*** | 548.143*** | 1,003.863* | 
| (11.793) | (22.119) | (577.253) | |
| Observations | 420 | 420 | 420 | 
| R2 | 0.564 | 0.566 | 0.569 | 
| Adjusted R2 | 0.559 | 0.561 | 0.558 | 
| Residual Std. Error | 12.648 (df = 415) | 12.626 (df = 414) | 12.666 (df = 409) | 
| F Statistic | 133.975*** (df = 4; 415) | 108.036*** (df = 5; 414) | 53.908*** (df = 10; 409) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||
Polynomial Models of higher degrees may be cumbersome
The right hand side of the equation has a correlation matrix with some high values
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| 1 | 1 | |||
| 2 | 0.96 | 1 | ||
| 3 | 0.87 | 0.97 | 1 | |
| 4 | 0.78 | 0.92 | 0.98 | 1 | 
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 1 | 1 | ||||
| 2 | 0.96 | 1 | |||
| 3 | 0.87 | 0.97 | 1 | ||
| 4 | 0.78 | 0.92 | 0.98 | 1 | |
| 5 | 0.7 | 0.85 | 0.95 | 0.99 | 1 | 
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | |||||||||
| 2 | 0.96 | 1 | ||||||||
| 3 | 0.87 | 0.97 | 1 | |||||||
| 4 | 0.78 | 0.92 | 0.98 | 1 | ||||||
| 5 | 0.7 | 0.85 | 0.95 | 0.99 | 1 | |||||
| 6 | 0.63 | 0.8 | 0.9 | 0.97 | 0.99 | 1 | ||||
| 7 | 0.58 | 0.74 | 0.86 | 0.94 | 0.98 | 1 | 1 | |||
| 8 | 0.53 | 0.7 | 0.82 | 0.91 | 0.96 | 0.98 | 1 | 1 | ||
| 9 | 0.5 | 0.66 | 0.79 | 0.88 | 0.94 | 0.97 | 0.99 | 1 | 1 | |
| 10 | 0.47 | 0.63 | 0.76 | 0.85 | 0.92 | 0.96 | 0.98 | 0.99 | 1 | 1 | 
| Dependent variable: | |||
| score | |||
| Degree 4 | Degree 5 | Degree 10 | |
| poly(income, 4)1 | 277.857*** | ||
| (12.648) | |||
| poly(income, 4)2 | -85.993*** | ||
| (12.648) | |||
| poly(income, 4)3 | 18.456 | ||
| (12.648) | |||
| poly(income, 4)4 | -28.013** | ||
| (12.648) | |||
| poly(income, 5)1 | 277.857*** | ||
| (12.626) | |||
| poly(income, 5)2 | -85.993*** | ||
| (12.626) | |||
| poly(income, 5)3 | 18.456 | ||
| (12.626) | |||
| poly(income, 5)4 | -28.013** | ||
| (12.626) | |||
| poly(income, 5)5 | 19.686 | ||
| (12.626) | |||
| poly(income, 10)1 | 277.857*** | ||
| (12.666) | |||
| poly(income, 10)2 | -85.993*** | ||
| (12.666) | |||
| poly(income, 10)3 | 18.456 | ||
| (12.666) | |||
| poly(income, 10)4 | -28.013** | ||
| (12.666) | |||
| poly(income, 10)5 | 19.686 | ||
| (12.666) | |||
| poly(income, 10)6 | -9.054 | ||
| (12.666) | |||
| poly(income, 10)7 | 3.863 | ||
| (12.666) | |||
| poly(income, 10)8 | -5.500 | ||
| (12.666) | |||
| poly(income, 10)9 | 1.712 | ||
| (12.666) | |||
| poly(income, 10)10 | 15.736 | ||
| (12.666) | |||
| Constant | 654.157*** | 654.157*** | 654.157*** | 
| (0.617) | (0.616) | (0.618) | |
| Observations | 420 | 420 | 420 | 
| R2 | 0.564 | 0.566 | 0.569 | 
| Adjusted R2 | 0.559 | 0.561 | 0.558 | 
| Residual Std. Error | 12.648 (df = 415) | 12.626 (df = 414) | 12.666 (df = 409) | 
| F Statistic | 133.975*** (df = 4; 415) | 108.036*** (df = 5; 414) | 53.908*** (df = 10; 409) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||
\[Y=f(X)+\varepsilon=\beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_pX_p+\varepsilon\]
is used to describe the relationship between \(Y\) and \(X_1,X_2,\dots,X_p\) and one fits using LS.
If number of observations \(n>>p\) then the least squares estimates have low variance in general. If this is not the case, the variance of \(\hat f()\) increased (due to overfitting) and the prediction performance gets poorer. If \(p>n\) then the variance become infinite and the method can not be used at all. By shrinking (constraining) the coefficients the variance can be reduced at the cost of increasing bias.
Including irrelevant variables in the model leads to complexity that may be avoided by removing them setting their coefficients to zero. Feature selection (variable selection, input selection) helps us to exclude irrelevant variables.
Subset Selection: Identify a subset of the \(p\) predictors that are related to \(Y\) and then fit the model
Shrinkage: Fit a model involving all predictors so that the coefficients are shrunken towards zero. Shrinkage (known as regularization) reduces the variance of \(\hat f()\)
Dimension Reduction: Project all \(p\) predictors into M-dimensional space where \(M<p\) by computing M different linear combinations (or projections)
Best Subset Selection
For each combination of \(p\) predictors, least square regression is fitted. Then the best model is selected based on some performance measures (cross-validation, AIC, BIC, adjusted \(R_{adj}^2\)). For this case, there are \(2^p\) possibilities considered as candidate models and it is quite time-consuming (even non-trivial) task for some larger values of \(p\).
Stepwise Selection
There exists a computational difficulty for best subset selection when the number of predictors is greater than 40+. Even selecting the best performing model on training data may results in poor performance on the test data. This big search space increases the chance of having large variance models.
Forward Stepwise Selection
Start with a model with no predictors (\(M_0\))
Add predictors to the model one at a time until all the predictors are in the model. At each step, the variable that gives the maximum additional improvement to the fit is added
The total number of models fitted, \(1+\sum_{k=0}^{p-1}(p-k)=1+ p(p+1)/2\), is substantially smaller than the best subset selection
Backward Stepwise Selection
Begin with full least squares model containing all \(p\) predictors
Remove the least useful predictor one at a time at each iteration
Hybrid Approaches
Forward and Backward stepwise selection result in similar but not identical models
Predictors are added to the model sequentially like forward stepwise selection
At each step of including predictors, the predictor that adds no improvement to the model is removed
Credit Card Balance Data (from ISLR2 package)
Let’s apply best subset selection using
leaps package. The figure below shows three selection
criteria
From top-left to bottom right, the number of variables and; i) Adjusted \(R2\), ii) \(BIC\), iii) \(RSS\) and iv) Mallow’s \(C_p\) (which is an unbiased estimate of test MSE) given as \(C_p=\frac{1}{n}(RSS+2d\hat\sigma^2)\), where, \(d\) is the number of predictors
Mallow’s \(C_p\) indicates 6 predictors: Income, Limit, Rating, Cards, Age, StudentYes
Recall the LSE \(\hat\beta=(X^TX)^{-1}X^TY\)
The inverse of a matrix may lead to a problem if the equations are poorly conditioned
Regularization is a technique that is used for modifying the function by adding an additional penalty term \(\delta\) to the diagonal of \(X^TX\)
\[\hat\beta=(X^TX+\lambda I_d)^{-1}X^TY\]
where \(I_d\) is the diagonal unit matrix.
or equivalently the function below is minimized
\[\sum_{1}^{n}\Big(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij}\Big)^2 + \lambda \sum_{j=1}^{p} \beta_j^2\]
This is called as Ridge Regression estimate. (can be derived it by taking the derivative of loss function w.r.t. \(\beta\))
This is also called shrinkage method
The additional penalty term regularizer, shrinkage penalty) controls the excessively fluctuating function
If \(\lambda\) is a very high number then the minimizing loss function results in \(\hat \beta \to 0, \: as \: \lambda \to \infty\)
If \(\lambda\) is a very small number then the minimizing loss function results in \(\beta=MLE \: estimation \; \hat \beta \: \:as \: \: \delta \to 0\)
Selecting a good value for \(\lambda\) is critical
Let’ start with an example of polynomial regression of degree 10 with
regularization (ridge regression) using mtcars data in
R
\(mpg=\beta_0+\beta_1hp+\beta_2hp^2+\cdots+\beta_{10}hp^{10}+\varepsilon\)
Let’s first obtain OLS estimate of \(\hat \beta=\{\hat \beta_0,...,\hat \beta_{10}\}\) (orthogonal polynomial)
Call:
lm(formula = mpg ~ poly(hp, 10), data = mtcars)
Residuals:
    Min      1Q  Median      3Q     Max 
-6.2029 -1.3077 -0.4330  0.8111  9.4244 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     20.0906     0.5809  34.588  < 2e-16 ***
poly(hp, 10)1  -26.0456     3.2858  -7.927 9.54e-08 ***
poly(hp, 10)2   13.1546     3.2858   4.003 0.000644 ***
poly(hp, 10)3   -2.2419     3.2858  -0.682 0.502508    
poly(hp, 10)4    0.2957     3.2858   0.090 0.929152    
poly(hp, 10)5   -1.3394     3.2858  -0.408 0.687669    
poly(hp, 10)6   -4.7101     3.2858  -1.433 0.166448    
poly(hp, 10)7    0.4862     3.2858   0.148 0.883769    
poly(hp, 10)8    2.6894     3.2858   0.818 0.422273    
poly(hp, 10)9    3.0125     3.2858   0.917 0.369637    
poly(hp, 10)10  -1.5065     3.2858  -0.458 0.651318    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.286 on 21 degrees of freedom
Multiple R-squared:  0.7987,    Adjusted R-squared:  0.7028 
F-statistic:  8.33 on 10 and 21 DF,  p-value: 2.562e-05
It seems that only the first three estimates are significant
The following graph shows the fit of this function.
| Fitted Coefficients | ||
| Selected Lambdas | ||
| Lambda: 125.893 | Lambda: 0.013 | |
|---|---|---|
| beta-0 | 20.091 | 20.091 | 
| beta-1 | -1.172 | -25.990 | 
| beta-2 | 0.592 | 13.127 | 
| beta-3 | -0.101 | -2.237 | 
| beta-4 | 0.013 | 0.295 | 
| beta-5 | -0.060 | -1.337 | 
| beta-6 | -0.212 | -4.700 | 
| beta-7 | 0.022 | 0.485 | 
| beta-8 | 0.121 | 2.684 | 
| beta-9 | 0.136 | 3.006 | 
| beta-10 | -0.068 | -1.503 | 
| Fitted Coefficients | |||
| Selected Lambdas | |||
| Lambda: 1000 | Lambda: 3.162 | Lambda: 0.01 | |
|---|---|---|---|
| beta-0 | 20.091 | 20.091 | 20.091 | 
| beta-1 | -0.154 | -16.989 | -26.002 | 
| beta-2 | 0.078 | 8.580 | 13.132 | 
| beta-3 | -0.013 | -1.462 | -2.238 | 
| beta-4 | 0.002 | 0.193 | 0.295 | 
| beta-5 | -0.008 | -0.874 | -1.337 | 
| beta-6 | -0.028 | -3.072 | -4.702 | 
| beta-7 | 0.003 | 0.317 | 0.485 | 
| beta-8 | 0.016 | 1.754 | 2.685 | 
| beta-9 | 0.018 | 1.965 | 3.007 | 
| beta-10 | -0.009 | -0.983 | -1.504 | 
| Fitted Coefficients
 Using Optimum, Lambda  | 
    |
| R-squared: 0.748 | |
| Lambda: 1.9953 | |
|---|---|
| beta-0 | 20.091 | 
| beta-1 | -19.490 | 
| beta-2 | 9.844 | 
| beta-3 | -1.678 | 
| beta-4 | 0.221 | 
| beta-5 | -1.002 | 
| beta-6 | -3.525 | 
| beta-7 | 0.364 | 
| beta-8 | 2.012 | 
| beta-9 | 2.254 | 
| beta-10 | -1.127 | 
let \(\| \beta \|_2=\sqrt{\sum_{j=1}^{p} \beta_j^2}\), which is called \(\ell_2 \; norm\), as \(\lambda\) increases always \(\|\beta\|_2\) decreases
Coefficients obtained by least Squares estimation (OLS), \(\lambda=0\), are scale equinvariant. If any predictors multiplied by a constant \(c\), then the coefficient estimate for this predictor is scaled by a factor \(1/c\). But, Ridge regression coefficient estimates can change substantially. Hence, standardizing (scaling with standard deviation) all predictors before estimating may be the best pre-processing.
As \(\lambda\) increases, the variance of \(\hat f()\) decreases but the bias increases. Hence, the cost of a slight increase bias may result in decrease in the variance
Recall that ridge regression loss function is:
\(L(\beta)=(Y-X\beta)^T(Y-X\beta)+\lambda^2 \beta^T \beta\)
where penalty parameter \(\lambda\) is used. We can generalize the loss function so that:
\(L(\theta)=(Y-X\beta)^T(Y-X\beta)+\lambda^2 (|\beta|^q)^T |\beta|^q\)
\(=\sum_{i=0}^{n} (y_i-\beta^T \phi(x_i)) + \lambda \sum_{j=1}^{p} |\beta_j|^q\)
Where if \(q=2\) then it becomes ridge regression (\(\ell_2 \; norm\)). If \(q=1\) (\(\|\beta\|_1,\;\ell_1 \; norm\)) then it is called as least absolute shrinkage and selection operator, Lasso regression.
Shaded areas as diamond (\(|\beta_1|+|\beta_2| < s, s \propto \lambda\)) and circle (\(\beta_1^2+\beta_2^2 < s\)) show the constraint function for lasso and ridge regression respectively. Contours are \(RSS\) and \(\hat \beta\) represents least squares estimates
Since the constraint has a diamond (polyhedron for three coefficients, or polytope for more than three coefficients) shape, the optimum value should be on one of the corners in lasso regression. Hence, one of the coefficient is set to zero
| Lasso Fitted Coefficients | ||
| Selected Lambdas | ||
| Lambda: 125.893 | Lambda: 0.013 | |
|---|---|---|
| beta-0 | 20.091 | 20.091 | 
| beta-1 | 0.000 | -25.974 | 
| beta-2 | 0.000 | 13.083 | 
| beta-3 | 0.000 | -2.171 | 
| beta-4 | 0.000 | 0.224 | 
| beta-5 | 0.000 | -1.268 | 
| beta-6 | 0.000 | -4.639 | 
| beta-7 | 0.000 | 0.415 | 
| beta-8 | 0.000 | 2.618 | 
| beta-9 | 0.000 | 2.941 | 
| beta-10 | 0.000 | -1.435 | 
| Lasso Fitted Coefficients | |||
| Selected Lambda Values | |||
| Lambda: 1000 | Lambda: 3.162 | Lambda: 0.01 | |
|---|---|---|---|
| beta-0 | 20.091 | 20.091 | 20.091 | 
| beta-1 | 0.000 | -8.157 | -25.989 | 
| beta-2 | 0.000 | 0.000 | 13.098 | 
| beta-3 | 0.000 | 0.000 | -2.185 | 
| beta-4 | 0.000 | 0.000 | 0.239 | 
| beta-5 | 0.000 | 0.000 | -1.283 | 
| beta-6 | 0.000 | 0.000 | -4.653 | 
| beta-7 | 0.000 | 0.000 | 0.430 | 
| beta-8 | 0.000 | 0.000 | 2.633 | 
| beta-9 | 0.000 | 0.000 | 2.956 | 
| beta-10 | 0.000 | 0.000 | -1.450 | 
| Fitted Coefficients
 Using Optimum, Lambda  | 
    |
| R-squared: 0.748 | |
| Lambda: 0.631 | |
|---|---|
| beta-0 | 20.091 | 
| beta-1 | -22.476 | 
| beta-2 | 9.585 | 
| beta-3 | 0.000 | 
| beta-4 | 0.000 | 
| beta-5 | 0.000 | 
| beta-6 | -1.141 | 
| beta-7 | 0.000 | 
| beta-8 | 0.000 | 
| beta-9 | 0.000 | 
| beta-10 | 0.000 | 
Here are some observations.
Lasso set some of the coefficients to zero. Hence it can be used for feature (input, a.k.a., independent variables that matter) selection. The final selected model may have only subset of the predictors.
Both Lasso and ridge handle correlated predictors. In ridge regression, correlated predictors coefficients are similar. In lasso, one of them is large but the other are set to (almost) zero.
If the most of the predictors have significant impact the dependent variable, ridge regression tends to perform better. If the small number of predictors significantly impact the dependent variable then lasso performs better in general. But in practice it is always good to perform cross-validation and check all regression results.
The next question then, “is it possible to use both lasso and ridge together?”
\(min \; \frac{1}{2n} \sum_{i=0}^{n} (y_i-\beta^T \phi(x_i))^2 + \lambda_2 \sum_{j=1}^{p} |\beta_j| + \lambda_1 \sum_{j=1}^{p} \beta_j^2\)
if we set, \(\alpha=\frac{\lambda_2}{\lambda_1+\lambda_2}\) and \(\lambda=\lambda_1+\lambda_2\) then
\(min \; \frac{1}{2n} \sum_{i=0}^{n} (y_i-\beta^T \phi(x_i))^2 + \lambda \left((1-\alpha) \sum_{j=1}^{p} \beta_j^2+ \alpha \sum_{j=1}^{p} |\beta_j| \right)\)
When \(\alpha=0\) it corresponds to ridge regression and lasso when it is one.
For the interpolation between \(\ell_1 \; norm\) and \(\ell_2 \; norm\) the following minimization performed, in general,
\(min \; \frac{1}{2n} \sum_{i=0}^{n} (y_i-\beta^T \phi(x_i))^2 + \lambda \left(\frac{1-\alpha}{2} \sum_{j=1}^{p} \beta_j^2+ \alpha \sum_{j=1}^{p} |\beta_j| \right)\)
| Fitted Coefficients
 Tuned, R2: 0.76  | 
    |
| Elastic Net: lambda = 0.917 Alpha = 0.694 | |
| Elastic_Net | |
|---|---|
| beta-0 | 20.091 | 
| beta-1 | -21.430 | 
| beta-2 | 9.123 | 
| beta-3 | 0.000 | 
| beta-4 | 0.000 | 
| beta-5 | 0.000 | 
| beta-6 | -1.060 | 
| beta-7 | 0.000 | 
| beta-8 | 0.000 | 
| beta-9 | 0.000 | 
| beta-10 | 0.000 |