I. Ozkan
Spring 2025
Section 7.4 of Introduction to Statistical Learning
Definition: A spline of degree \(d\) is a function formed by connecting polynomial segments of degree \(d\) with following function properties:
Let Cubic Polynomial Regressionis;
\(y_i=\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \varepsilon_i\)
Coefficients, \(\beta_i, \; i=1,2,3\) are constant over the entire range of \(X\).
If we let \(\beta_i\) change in different ranges of \(x\) then it became the cubic spline. The points where the coefficients change are called knots.
A single knot cubic polynomial with a single knot at c
\(y_i = \begin{cases} y_i=\beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \varepsilon_{1,i} & \quad \text{if } x_i<c \\ y_i=\beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \varepsilon_{2,i} & \quad \text{if } x_i \geq c \end{cases}\)
ISLR2 package| Wage Data (ISLR2 Package) | |
| wage and age Variables | |
| wage | age | 
|---|---|
| 114.02258 | 26 | 
| 65.11085 | 36 | 
| 141.77517 | 59 | 
| 186.87668 | 50 | 
| 173.87949 | 55 | 
| 93.87868 | 30 | 
| Dependent variable: | |||
| wage | |||
| (1) | (2) | (3) | |
| poly(age, 3, raw = T)1 | -0.339 | -91.636* | -525.260 | 
| (14.837) | (48.915) | (774.593) | |
| poly(age, 3, raw = T)2 | 0.082 | 2.778* | 8.813 | 
| (0.339) | (1.398) | (12.969) | |
| poly(age, 3, raw = T)3 | -0.001 | -0.027** | -0.049 | 
| (0.002) | (0.013) | (0.072) | |
| Constant | 57.336 | 1,056.080* | 10,487.720 | 
| (205.923) | (556.584) | (15,295.300) | |
| Observations | 75 | 57 | 18 | 
| R2 | 0.088 | 0.154 | 0.048 | 
| Adjusted R2 | 0.049 | 0.106 | -0.157 | 
| Residual Std. Error | 46.184 (df = 71) | 41.798 (df = 53) | 59.470 (df = 14) | 
| F Statistic | 2.283* (df = 3; 71) | 3.204** (df = 3; 53) | 0.233 (df = 3; 14) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||
In order to avoid a discontinuity seen at the left figure, one can impose constraints to force the fitted line is continuous at knot (age=50)
In cubic spline, there are three constraints;
Continuous at the knot
the first derivative is Continuous at the knot
the second derivative is Continuous at the knot
\[y=\beta_0 + \sum_{i=1}^{d} \beta_i x^i + \sum_{j=1}^{k} b_j (x-\zeta_j)_+^d\]
where,
\(\zeta_j\) is the \(j^{th}\) knot and truncated polynomial,
\((x-\zeta_j)_+^d = \begin{cases} (x-\zeta_j)^d & \quad \text{if } x_i \geq \zeta_j \\ 0 & \quad \text{if } x_i < \zeta_j \end{cases}\)
and
\(b(.)\) is the basis function
Example: Linear Spline with one knot at 50
\(y=\beta_0 + \beta_1 b_1(x) + \beta_{1+1} (x-50)_+\)
or since \(b_1(x)\equiv x^1=x\)
\(y=\beta_0 + \beta_1 x + \beta_{1+1} (x-50)_+\)
where,
\((x_i-50)_+ = \begin{cases} (x_i-50) & \quad \text{if } x_i \geq 50 \\ 0 & \quad \text{if } x_i < 50 \end{cases}\)
Example: Quadratic Spline with one knot at 50
\(y=\beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \beta_{1+2} (x-50)_+^2\)
\((x_i-50)_+^2 = \begin{cases} (x_i-50)^2 & \quad \text{if } x_i \geq 50 \\ 0 & \quad \text{if } x_i < 50 \end{cases}\)
Example: Cubic Spline with one knot at 50
\(y=\beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \beta_3 b_3(x) + \beta_{1+3} (x-50)_+^3\)
\((x_i-50)_+^3 = \begin{cases} (x_i-50)^3 & \quad \text{if } x_i \geq 50 \\ 0 & \quad \text{if } x_i < 50 \end{cases}\)
Cubic Spline with several knots at \(\zeta_1, \zeta_2, \dots, \zeta_K\)
Since there are \(K\) knots, truncated power basis like above should be added for each intervals
\[y=\beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \beta_3 b_3(x) + \beta_{1+3} (x-\zeta_1)_+^3+\cdots+\beta_{K+3} (x-\zeta_K)_+^3\]
Then there are \(K+4\) regression coefficients to be estimated and hence fitting cubic spline (degree=3) with \(K\) knots uses \(K+4\) degrees of freedom
\[ \begin{pmatrix} 1 & x_1 & x_1^2 & x_1^3 & \cdots & x_1^d & (x_1-\zeta_1)_+^d \cdots & (x_1-\zeta_k)_+^d\\ 1 & x_2 & x_2^2 & x_2^2 & \cdots & x_2^d & (x_2-\zeta_1)_+^d \cdots & (x_2-\zeta_k)_+^d\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & x_n^2 & \cdots & x_n^d & (x_n-\zeta_1)_+^d \cdots & (x_n-\zeta_k)_+^d\\ \end{pmatrix}\]
Design matrix dimensions \(n \times 1+k+d\)
The degree, \(d\), and knots
\(k\) location must be determined
first.
Design matrix may be very large and columns may have a high correlation and hence, the estimation may not be stable.
Splines may have very high variance at the boundary range of predictors. Hence One might one to add extra constraints at the boundaries. This is called natural splines.
A natural spline is a regression spline with additional boundary constraints: the natural function is required to be linear at the boundary (in the region where X is spline smaller than the smallest knot, or larger than the largest knot)
This additional constraint means that natural splines generally produce more stable estimates at the boundaries
Natural Splines allow extrapolation beyond the boundary knots. This means to adding 4 extra constraints
Here is the figure showing cubic splines and natural cubic splines with three knots at (25, 40, 60)
Let’s draw the basis functions for both splines and natural splines
using R. In this case, ggfortify (to plot
the basis functions) and splines packages are used.
How many knots should be chosen? (equivalently how many degrees of freedom).
Place more knots where we feel the function might vary more rapidly and fewer knots where it seems more stable
Try out different uniformly distribute knots and check the best looking curve
Use cross-validation, repeat; predict left-out data like in cross-validation and select the knots that produces minimum total \(RSS\)
\(\underbrace{\sum_{i=1}^n(y_i-g(x_i))^2}_\text{residual squares}+\underbrace{\lambda\int g''(t)^2dt}_\text{roughness penalty}\)
where, \(\lambda\) is called tuning parameter.
A side note
Here is the Hodrick-Prescott filter (which is a special case of a Smoothing Splines);
\(\begin{split}\min_{\\{ \tau_{t}\\} }\sum_{t}^{T}(y_t - \tau_t)^2 +\lambda\sum_{t=1}^{T}\left[\left(\tau_{t}-\tau_{t-1}\right)-\left(\tau_{t-1}-\tau_{t-2}\right)\right]^{2}\end{split}\)
End of a side note
Residual Sum of Squares is Loss and the integral is the penalty hence total it is Loss+Penalty.
Loss encourages the function \(g(.)\) fit the data well
Penalty penalizes the variability in \(g(.)\)
\(g'(t)\) is the slope and \(g''(t)\) is the second derivative that corresponds to change in slope, broadly speaking it is a measure of roughness
When \(\lambda \to \infty\), \(g(.)\) is a linear function passes as closely as possible to the observations
When \(\lambda =0\), \(g(.)\) is a very jumpy that passes through the observations
\(\lambda\) controls the bias-variance trade-off of the spline.
“Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.”
Linear Regression Model
\(y_i=\beta_0+\beta_1x_{i,1}+\beta_2x_{i,2}+\cdots+\beta_px_{i,p}+\varepsilon_i\)
Each component \(\beta_j x_{i,j}\) can be replaced by a smooth non-linear function \(f_j(x_{i,j})\) to allow for non-linear relationships between each feature and response variables
\(y_i=\beta_0+\beta_1x_{i,1}+\beta_2x_{i,2}+\cdots+\beta_px_{i,p}+\varepsilon_i\)
\(y_i=\beta_0 + f_1(x_{i,1}) + f_2(x_{i,2}) + \cdots + f_p(x_{i,p}) + \varepsilon_i\)
This is an example of GAM and it is called an additive since all \(f_j\) are calculated separately and added to account their contributions
\(wage=\beta_0+f_1(year)+f_2(age)+f_3(education)+\varepsilon\)
Education is qualitative (with five levels: <HS, HS, <Coll, Coll, >Coll) and all other variables are quantitative
The effect of education is captured through step function fit to the qualitative data
Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-119.43  -19.70   -3.33   14.17  213.48 
(Dispersion Parameter for gaussian family taken to be 1235.69)
    Null Deviance: 5222086 on 2999 degrees of freedom
Residual Deviance: 3689770 on 2986 degrees of freedom
AIC: 29887.75 
Number of Local Scoring Iterations: NA 
Anova for Parametric Effects
             Df  Sum Sq Mean Sq F value    Pr(>F)    
s(year, 4)    1   27162   27162  21.981 2.877e-06 ***
s(age, 5)     1  195338  195338 158.081 < 2.2e-16 ***
education     4 1069726  267432 216.423 < 2.2e-16 ***
Residuals  2986 3689770    1236                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for Nonparametric Effects
            Npar Df Npar F  Pr(F)    
(Intercept)                          
s(year, 4)        3  1.086 0.3537    
s(age, 5)         4 32.380 <2e-16 ***
education                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From left to right, the first two functions are natural splines in year and age, with four and five degrees of freedom, respectively, the third is a step function
This figure can be interpreted as:
Non-linear relationships can be modeled
Potentially GAMs make more accurate predictions
It is possible to examine the effect of each predictors on response individually since the model is additive
Important interactions may be missed since the model is additive
It is possible to have a model that has a overfitting problem
Note: The multivariate adaptive regression spline (MARS) is out of this course