Splines (Spline Regression)

Section 7.4 of Introduction to Statistical Learning
Definition: A spline of degree \(d\) is a function formed by connecting polynomial segments of degree \(d\) with following function properties:

continuous
smooth: \(d-1\) continuous derivatives
\(d^{th}\) derivative is constant between knots

Instead of fitting polynomial regression over the entire range of \(X\) values, fitting the separate low-degree polynomial regression over different subsets of \(X\)

Piece-Wice Regression

Let Cubic Polynomial Regressionis;

\(y_i=\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \varepsilon_i\)

Coefficients, \(\beta_i, \; i=1,2,3\) are constant over the entire range of \(X\).
If we let \(\beta_i\) change in different ranges of \(x\) then it became the cubic spline. The points where the coefficients change are called knots.

A single knot cubic polynomial with a single knot at c

\(y_i = \begin{cases} y_i=\beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \varepsilon_{1,i} & \quad \text{if } x_i<c \\ y_i=\beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \varepsilon_{2,i} & \quad \text{if } x_i \geq c \end{cases}\)

Two different cubic regression fitted to the data on the subset \(x_i<c\) and on the subset \(x_i \geq c\)

Piecewise Regression Example

Let’s fit two cubic regression to the sample of wage data (75 observations sampled randomly) from ISLR2 package

wage	age
Wage Data (ISLR2 Package)
wage and age Variables
114.02258	26
65.11085	36
141.77517	59
186.87668	50
173.87949	55
93.87868	30

**Cubic Regression Results (1:all, 2:less then 50, 2:greater then 50)**

	Dependent variable:

	wage
	(1)	(2)	(3)

poly(age, 3, raw = T)1	-0.339	-91.636^*	-525.260
	(14.837)	(48.915)	(774.593)

poly(age, 3, raw = T)2	0.082	2.778^*	8.813
	(0.339)	(1.398)	(12.969)

poly(age, 3, raw = T)3	-0.001	-0.027^**	-0.049
	(0.002)	(0.013)	(0.072)

Constant	57.336	1,056.080^*	10,487.720
	(205.923)	(556.584)	(15,295.300)


Observations	75	57	18
R²	0.088	0.154	0.048
Adjusted R²	0.049	0.106	-0.157
Residual Std. Error	46.184 (df = 71)	41.798 (df = 53)	59.470 (df = 14)
F Statistic	2.283^* (df = 3; 71)	3.204^** (df = 3; 53)	0.233 (df = 3; 14)

Note:	p<0.1; p<0.05; p<0.01

Piecewise Regression Example

The figure below shows (a) two cubic polynomial regressions fitted for each segment of the data, (b) one cubic regression fitted all of the data

In order to avoid a discontinuity seen at the left figure, one can impose constraints to force the fitted line is continuous at knot (age=50)
In cubic spline, there are three constraints;

Continuous at the knot
the first derivative is Continuous at the knot
the second derivative is Continuous at the knot

Generalization

The equation of simple spline with \(k\) knots

\[y=\beta_0 + \sum_{i=1}^{d} \beta_i x^i + \sum_{j=1}^{k} b_j (x-\zeta_j)_+^d\]

where,

\(\zeta_j\) is the \(j^{th}\) knot and truncated polynomial,

\((x-\zeta_j)_+^d = \begin{cases} (x-\zeta_j)^d & \quad \text{if } x_i \geq \zeta_j \\ 0 & \quad \text{if } x_i < \zeta_j \end{cases}\)

and

\(b(.)\) is the basis function

The following figure shows linear to \(4^{th}\) degree polynomial splines with a knot=50

Polynomial Spline

Polynomial basis function is given as \(b_j(x) \equiv x^j\)

Example: Linear Spline with one knot at 50

\(y=\beta_0 + \beta_1 b_1(x) + \beta_{1+1} (x-50)_+\)

or since \(b_1(x)\equiv x^1=x\)

\(y=\beta_0 + \beta_1 x + \beta_{1+1} (x-50)_+\)

where,

\((x_i-50)_+ = \begin{cases} (x_i-50) & \quad \text{if } x_i \geq 50 \\ 0 & \quad \text{if } x_i < 50 \end{cases}\)

Example: Quadratic Spline with one knot at 50

\(y=\beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \beta_{1+2} (x-50)_+^2\)

\((x_i-50)_+^2 = \begin{cases} (x_i-50)^2 & \quad \text{if } x_i \geq 50 \\ 0 & \quad \text{if } x_i < 50 \end{cases}\)

Example: Cubic Spline with one knot at 50

\(y=\beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \beta_3 b_3(x) + \beta_{1+3} (x-50)_+^3\)

\((x_i-50)_+^3 = \begin{cases} (x_i-50)^3 & \quad \text{if } x_i \geq 50 \\ 0 & \quad \text{if } x_i < 50 \end{cases}\)

Cubic Spline with several knots at \(\zeta_1, \zeta_2, \dots, \zeta_K\)

Since there are \(K\) knots, truncated power basis like above should be added for each intervals

\[y=\beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \beta_3 b_3(x) + \beta_{1+3} (x-\zeta_1)_+^3+\cdots+\beta_{K+3} (x-\zeta_K)_+^3\]

Then there are \(K+4\) regression coefficients to be estimated and hence fitting cubic spline (degree=3) with \(K\) knots uses \(K+4\) degrees of freedom

Boundary Regions

Splines can have high variance at the outer range of predictors ()

Estimation

To estimate these coefficients, OLS can be used. But, the design matrix,

\[ \begin{pmatrix} 1 & x_1 & x_1^2 & x_1^3 & \cdots & x_1^d & (x_1-\zeta_1)_+^d \cdots & (x_1-\zeta_k)_+^d\\ 1 & x_2 & x_2^2 & x_2^2 & \cdots & x_2^d & (x_2-\zeta_1)_+^d \cdots & (x_2-\zeta_k)_+^d\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & x_n^2 & \cdots & x_n^d & (x_n-\zeta_1)_+^d \cdots & (x_n-\zeta_k)_+^d\\ \end{pmatrix}\]

Design matrix dimensions \(n \times 1+k+d\)
The degree, \(d\), and knots \(k\) location must be determined first.
Design matrix may be very large and columns may have a high correlation and hence, the estimation may not be stable.
Splines may have very high variance at the boundary range of predictors. Hence One might one to add extra constraints at the boundaries. This is called natural splines.

Natural Splines

A natural spline is a regression spline with additional boundary constraints: the natural function is required to be linear at the boundary (in the region where X is spline smaller than the smallest knot, or larger than the largest knot)
This additional constraint means that natural splines generally produce more stable estimates at the boundaries
Natural Splines allow extrapolation beyond the boundary knots. This means to adding 4 extra constraints
Here is the figure showing cubic splines and natural cubic splines with three knots at (25, 40, 60)

Basis Functions

Let’s draw the basis functions for both splines and natural splines using R. In this case, ggfortify (to plot the basis functions) and splines packages are used.

Basis Functions (and Model Matrix)

Here are the model matrix for (a) Quadratic Regression (\(\hat Y=\beta_0+\beta_1 x + \beta_2 x^2\), upper row), (b) quadratic spline (without knot specification; results in a basis for ordinary polynomial regression, second row), (c) natural spline (without knot specification, third row) and (d) quadratic spline (with knot set at \(1\), last row)

Choosing Number of Knots

How many knots should be chosen? (equivalently how many degrees of freedom).
Place more knots where we feel the function might vary more rapidly and fewer knots where it seems more stable
Try out different uniformly distribute knots and check the best looking curve
Use cross-validation, repeat; predict left-out data like in cross-validation and select the knots that produces minimum total \(RSS\)

Smoothing Splines

\(\underbrace{\sum_{i=1}^n(y_i-g(x_i))^2}_\text{residual squares}+\underbrace{\lambda\int g''(t)^2dt}_\text{roughness penalty}\)

where, \(\lambda\) is called tuning parameter.

A side note

Here is the Hodrick-Prescott filter (which is a special case of a Smoothing Splines);

\(\begin{split}\min_{\\{ \tau_{t}\\} }\sum_{t}^{T}(y_t - \tau_t)^2 +\lambda\sum_{t=1}^{T}\left[\left(\tau_{t}-\tau_{t-1}\right)-\left(\tau_{t-1}-\tau_{t-2}\right)\right]^{2}\end{split}\)

End of a side note

Residual Sum of Squares is Loss and the integral is the penalty hence total it is Loss+Penalty.
Loss encourages the function \(g(.)\) fit the data well
Penalty penalizes the variability in \(g(.)\)
\(g'(t)\) is the slope and \(g''(t)\) is the second derivative that corresponds to change in slope, broadly speaking it is a measure of roughness
When \(\lambda \to \infty\), \(g(.)\) is a linear function passes as closely as possible to the observations
When \(\lambda =0\), \(g(.)\) is a very jumpy that passes through the observations
\(\lambda\) controls the bias-variance trade-off of the spline.

Generalized Additive Model (GAM)

Page - 307 of ISLR book:

“Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.”

Can be applied for both quantitative and qualitative responses, \(Y\)

Linear Regression Model

\(y_i=\beta_0+\beta_1x_{i,1}+\beta_2x_{i,2}+\cdots+\beta_px_{i,p}+\varepsilon_i\)

Each component \(\beta_j x_{i,j}\) can be replaced by a smooth non-linear function \(f_j(x_{i,j})\) to allow for non-linear relationships between each feature and response variables

\(y_i=\beta_0+\beta_1x_{i,1}+\beta_2x_{i,2}+\cdots+\beta_px_{i,p}+\varepsilon_i\)

\(y_i=\beta_0 + f_1(x_{i,1}) + f_2(x_{i,2}) + \cdots + f_p(x_{i,p}) + \varepsilon_i\)

This is an example of GAM and it is called an additive since all \(f_j\) are calculated separately and added to account their contributions

Any type of function \(f_j()\) can be used as the basis for GAM, including splines, polynomial regression etc.

GAM Example (page 308, ISLR)

\(wage=\beta_0+f_1(year)+f_2(age)+f_3(education)+\varepsilon\)

Education is qualitative (with five levels: <HS, HS, <Coll, Coll, >Coll) and all other variables are quantitative
The effect of education is captured through step function fit to the qualitative data


Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-119.43  -19.70   -3.33   14.17  213.48 

(Dispersion Parameter for gaussian family taken to be 1235.69)

    Null Deviance: 5222086 on 2999 degrees of freedom
Residual Deviance: 3689770 on 2986 degrees of freedom
AIC: 29887.75 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
             Df  Sum Sq Mean Sq F value    Pr(>F)    
s(year, 4)    1   27162   27162  21.981 2.877e-06 ***
s(age, 5)     1  195338  195338 158.081 < 2.2e-16 ***
education     4 1069726  267432 216.423 < 2.2e-16 ***
Residuals  2986 3689770    1236                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
            Npar Df Npar F  Pr(F)    
(Intercept)                          
s(year, 4)        3  1.086 0.3537    
s(age, 5)         4 32.380 <2e-16 ***
education                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From left to right, the first two functions are natural splines in year and age, with four and five degrees of freedom, respectively, the third is a step function
This figure can be interpreted as:
- Holding age and education fixed, wage tends to increase slightly with year
- Holding education and year fixed wage tends to be highest for mid-working ages
- Holding year and age fixed wage tends to increase with education

Pros and Cons of GAMs

Non-linear relationships can be modeled
Potentially GAMs make more accurate predictions
It is possible to examine the effect of each predictors on response individually since the model is additive
Important interactions may be missed since the model is additive
It is possible to have a model that has a overfitting problem

Note: The multivariate adaptive regression spline (MARS) is out of this course

Spline Regression
Generalized Additive Models

Splines (Spline Regression)

Piece-Wice Regression

Piecewise Regression Example

Piecewise Regression Example

Generalization

Polynomial Spline

Boundary Regions

Estimation

Natural Splines

Basis Functions

Basis Functions (and Model Matrix)

Choosing Number of Knots

Smoothing Splines

Generalized Additive Model (GAM)

GAM Example (page 308, ISLR)

Pros and Cons of GAMs

GAMs for Classification

Spline Regression Generalized Additive Models

Splines (Spline Regression)

Piece-Wice Regression

Piecewise Regression Example

Piecewise Regression Example

Generalization

Polynomial Spline

Boundary Regions

Estimation

Natural Splines

Basis Functions

Basis Functions (and Model Matrix)

Choosing Number of Knots

Smoothing Splines

Generalized Additive Model (GAM)

GAM Example (page 308, ISLR)

Pros and Cons of GAMs

GAMs for Classification

Spline Regression
Generalized Additive Models