Statistical Learning, An Intro

I. Ozkan

Fall 2025

Preliminary Reading

Learning Objectives

Keywords:

Predictors and Response (Dependent, Output) Variables

\[Y=f(X)+\varepsilon\]

\[Y=f(X)+\varepsilon=Pattern+Error\]

Function \(f()\)

An Example, Salary & Age

*: Data may be obtained from Hull website

First Six Observations of Salary and Age Data
Age Salary
25 135000
55 260000
27 105000
35 220000
60 240000
65 265000

An Example, Salary & Age

An Example, Salary & Age

An Example, Salary & Age (MIS 305, MIS 314, MIS 315, MIS 403)

Regression Models are basic building blocks of many complex models. This part is covered in MIS 305 course

Regression Results: 1-Linear, 2-Quadratic 3-Fifth Degree
Dependent variable:
Salary
(1) (2) (3)
Age 3,827.305** 25,377.170** -2,312,166.000**
(1,246.189) (7,251.933) (705,471.800)
I(Age2) -243.289** 107,539.500**
(81.263) (33,840.510)
I(Age3) -2,403.642**
(790.506)
I(Age4) 25.993**
(9.005)
I(Age5) -0.109*
(0.040)
Constant 51,160.420 -382,171.100** 19,198,421.000**
(56,360.290) (150,139.600) (5,723,899.000)
Observations 10 10 10
R2 0.541 0.799 0.969
Adjusted R2 0.484 0.741 0.930
Residual Std. Error 52,747.830 (df = 8) 37,341.470 (df = 7) 19,353.310 (df = 4)
F Statistic 9.432** (df = 1; 8) 13.892*** (df = 2; 7) 25.099*** (df = 5; 4)
Note: p<0.1; p<0.05; p<0.01

Why Estimate \(f()\), Prediction

\(X=(X_1, X_2, \cdots,X_p)\) are available but \(Y\) can not be obtained.

\(\hat Y=\hat f(X)\) since \(E(\varepsilon)=0\)

\(\hat f()\) may be a black box model where exact form is not important but it predicts \(Y\) accurately

Reducible (\(\hat f()\) is not perfect estimate of \(f\)) and irreducible error (\(\hat f()\) almost perfect estimate of \(f\), but \(Y\) is a function of \(\varepsilon\))

The expected value of the squared difference between actual and predicted value of \(Y\)

\(E(Y-\hat Y)^2=E[f(X)+\varepsilon -\hat f(X)]^2\)

\(=\underbrace{E[f(X) -\hat f(X)]^2}_{reducible} +\underbrace{Var(\varepsilon)}_{irreducible}\)

\(\varepsilon\) may contain (i) unmeasured variables and (ii) unmeasurable variation

The focus is to minimize the reducible error with different techniques for estimating \(\hat f()\)

An Example, Salary & Age

Age and Salary Predictions
Test Data
Age Salary Lin_pred Quad_pred Fifth_pred
30 166000 165979.6 160184.1 117292.62
26 78000 150670.4 113172.1 111072.33
58 310000 273144.1 271281.7 245229.64
29 100000 162152.3 149161.0 104937.28
40 260000 204252.6 243653.8 284897.91
27 150000 154497.7 125655.0 99724.96
33 140000 177461.5 190334.1 173237.90
61 220000 284626.0 260559.2 257932.17
27 86000 154497.7 125655.0 99724.96
48 276000 234871.1 275396.0 277161.45

An Example, Salary & Age

An Example, Salary & Age

Some Model Performance Measures
Training Data
Models r.squared adj.r.squared sigma logLik AIC BIC nobs
Linear 0.5410820 0.4837173 52747.83 -121.8064 249.6129 250.5206 10
Quadratic 0.7987589 0.7412614 37341.47 -117.6846 243.3692 244.5796 10
Fifth Pol. 0.9691108 0.9304994 19353.30 -108.3141 230.6282 232.7463 10

An Example, Salary & Age

data_set Linear Lin_Q Lin_5
Training 47179.09 31242.12 12240.10
Test 50590.16 34348.43 36832.84

Why Estimate \(f()\), Inference

The main aim is to understand the relationship between \(X\) and \(Y\). The main aim is not necessarily to make prediction.

\(\hat f()\) should be chosen so that it is interpretable.

Questions are:

Inference: An Example, Advertising Data (ISLR Book)

Advertising Data, First 6 Observations
ISLR Book
TV radio newspaper sales
230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9
8.7 48.9 75.0 7.2

– Which media contribute to sales?

– Which media generate the biggest boost in sales?

– How much increase in sales is associated with a given increase in TV advertising?

How Do We Estimate \(f\)

If the selected model performance is poor, one may choose to select more flexible models. This may result in overfitting the data (model follow the error too closely)

Parametric Example (Fig 2.4 ISLR)

As an example, here we use income, education and seniority relationship where the true underlying relationship shown as below:

Fig. 2.3
Fig. 2.3

Parametric Example

\[income=\beta_0 + \beta_1 \times education + \beta_2 \times seniority\]

Then the estimated function:

Fig. 2.4
Fig. 2.4

Non-Parametric Example

Compare this with a non-parametric approach shown below:

Fig. 2.5
Fig. 2.5

Flexibility of Models

Prediction Accuracy and Model Interpretability

Supervised vs Unsupervised Learning

Fig 2.8
Fig 2.8

Regression vs Classification

Assessing Model Accuracy

Low Bias and Low Variance

Fig. 2.9
Fig. 2.9

Low Bias and Low Variance

Fig. 2.11
Fig. 2.11

An Example, Salary & Age

Model Assessment and Model Selection