Linear Models (Regressions)

I. Ozkan

Fall 2025

Preliminary Readings

Learning Objectives

Keywords:

Linear Models (Linear Regression)

An Example, Salary & Age


Age and Salary Predictions
Age Salary (10000)
25 13.5
55 26.0
27 10.5
35 22.0
60 24.0
65 26.5
Data can be obtained: Hull Book website

Advertising Data (ISLR Book)


Advertising Data
ISLR Book
sales Media Adv. Exp.
22.1 TV 230.1
22.1 radio 37.8
22.1 newspaper 69.2
10.4 TV 44.5
10.4 radio 39.3
10.4 newspaper 45.1

How to Estimate: Least Square Estimation

Let:

\(\{(x_1, y_1), (x_2,y_2), \cdots , (x_n, y_n)\}\) are the \(n\) input-output pairs. \(x_i \in \mathbb{R^k}\) is a vector with \(k\) attributes.

Output (target), \(y_i \in \mathbb{R}\) is univariate.

Then the linear model is;

\(y_i=\beta_0+\beta_1x_1+ \beta_2x_2+ \cdots + \beta_kx_k+\varepsilon_i\)

The estimation problem is then to find \(\hat \beta\) with some regular assumptions (lets skip them and refer to Data Modeling course).

Find parameter, \(\hat \beta\), vector by minimizing Loss (cost, error etc) function, \(L(\beta)\). Loss function can be written as:

\[L(\beta)=\sum_{}^{}\varepsilon_i^2\]

\[L(\beta)=\sum_{}^{}[y_i-(\hat\beta_0+\hat\beta_1x_1+ \hat\beta_2x_2+ \cdots + \hat\beta_kx_k)]^2\]

Then \(\hat \beta\) can be estimated by:

\(\operatorname*{arg\,min}_\beta L(\beta)\)

The loss function can be written in matrix form:

\(L(\beta)=\varepsilon^2=\underbrace{(Y-X\beta)^T}_{\varepsilon^T_{(1xn)}}\underbrace{(Y-X\beta)}_{\varepsilon_{(nx1)}}\)

Both the matrix differentiation (hint: \(\frac{\partial X \beta}{\partial \beta}=X^T \: and \: \frac{\partial \beta^T X \beta}{\partial \beta}=2X^T \beta\)) necessary for finding the optimal \(\beta\) vector or matrix manipulation result in the same estimation.

\(\frac {\partial L(\beta)}{\partial \beta}=\frac {\partial }{\partial \beta}\big[Y^TY+\beta^TX^TX\beta-2Y^TX\beta\big]=0\)

\(\implies \frac {\partial L(\beta)}{\partial \beta}=0+2X^TX\beta-2X^TY=0\)

\(2X^TX\beta=2X^TY \implies \hat\beta=(X^TX)^{-1}X^TY\)

How to Estimate: Maximum Likelihood Estimation

Likelihood Function

Let, \(X\sim N(\mu, \sigma^2)\) then the distribution function is,

\[P(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{-(x - \mu)^2/2\sigma ^2 }\]
Assume that we observe, \(X=\{x_1,x_2,\cdots,x_n\}\) and the estimation problem of \(\mu,\sigma\) can be formulated as the solution of the maximization of the likelihood function,

\(P(x_1,x_2,...,x_n|parameters)\)

An easy example for illustration (\(\mu\))

4 observations,

\(x_1=0, \: x_2= 1, \: x_3=0.7, \: x_4=1.5\)

and let assume that,

\(x_i \sim N(\mu,1)=\mu+N(0,1)\)

The likelihood given observations are independent is,

\(P(x_1,x_2,x_3,x_4|\mu)=P(x_1|\mu)P(x_2|\mu)P(x_3|\mu)P(x_4|\mu)\)

Using dnorm(.) function of R, let’s calculate with different parameter values for \(\mu\)

Hint:

density function for \(N(\mu,\sigma^2)\)

\(\frac{1}{{\sigma \sqrt {2\pi } }}e^{-(x - \mu)^2/2\sigma ^2 }\)

hence, for the sake of simplicity, for example, density for standard normal distribution, \(N\sim(0,1)\)

\(\frac{1}{\sqrt{2\pi}}e^\frac{-x^2}{2}\)

\(P(x=1|\mu=0, \sigma=1)=dnorm(1) = \frac{1}{\sqrt{2\pi}}e^\frac{-1}{2}=0.242\)

Assume that, \(\mu=1.5, \sigma=1\),

\(P(x=0|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(0-1.5)^2}{2}=0.13\)

\(P(x=1|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(1-1.5)^2}{2}=0.352\)

\(P(x=0.7|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(0.7-1.5)^2}{2}=0.29\)

\(P(x=1.5|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(1.5-1.5)^2}{2}=0.399\)

\(\mu=1.5 \implies P(x_1=0,x_2=1,x_3=0.7,x_4=1.5|\mu=1.5,\sigma=1)=0.0053\)

If we set \(\mu\) different value, for example, let \(\mu=1\),


\(P(x_1=0,x_2=1,x_3=0.7,x_4=1.5|\mu=1,\sigma=1)=0.013\)

which one is higher? And what would be \(\mu\) value that maximizes likelihood (hint, mean value!).

\(P(x_1,x_2,x_3,x_4|\mu=0.8,\sigma=1)=0.014\)

The likelihood for linear regression

Let, \(Y\),

\(y_i \sim N(x_i^T\beta, \sigma^2)=x_i^T\beta+N(0, \sigma^2)\)

\(P(Y|X,\beta,\sigma)=\prod_{1}^{n}P(y_i|x_i,\beta,\sigma)\)

\(P(Y|X,\theta,\sigma)=\prod_{1}^{n}(2 \pi \sigma^2)^{-1/2} e^{-\frac {1}{2\sigma^2}(y_i-x_i^T\beta)^2}\)

\(P(Y|X,\theta,\sigma)=(2 \pi \sigma^2)^{-n/2} e^{-\frac {1}{2\sigma^2}(\sum_{i=n}^{n}y_i-x_i^T\beta)^2}\)

The next reasonable step to take the logarithm of this function (\(P>0\)) which is called Log-Likelihood.

\(Log(L(\beta))=l(\beta)=-\frac{n}{2}ln(2 \pi \sigma^2)- \frac{1}{2 \sigma^2}(Y-X\beta)^T(Y-X\beta)\)

Take the partial derivatives, (i) assuming \(\sigma\) is known to find \(\beta\) (should result in the same equation as Least Squares), and (ii) assuming \(\beta\) is known to find \(sigma\) (should result in the estimator of Variance,

\(E[(y_i-x_i^T\beta)^2]=\frac{1}{n}(Y-X\beta)^T(Y-X\beta)\)

Then prediction given data, \((X_*,y)\) is,

\(P(y|X_*,\beta,\sigma)=N(y|X_*^T\beta_{ML},\sigma^2)\)

Advertising Data

Let \(RSS=\varepsilon_1^2+\varepsilon_2^2+\cdots+\varepsilon_n^2=\sum_{1}^{n}\varepsilon_i^2\)

Linear Models

Simple Linear Regression Example


Journal Data
Few Observations
subs price citeprice
14 123 5.8571429
59 20 0.9090909
17 443 20.1363636
2 276 12.5454545
96 295 12.2916667
15 344 14.3333333

Simple Linear Regression Example: Journal Pricing Data

\(ln(subs)_i=\beta_0+\beta_1 ln(citeprice)_i + \varepsilon_i\)


Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.72478 -0.53609  0.03721  0.46619  1.84808 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.76621    0.05591   85.25   <2e-16 ***
log(citeprice) -0.53305    0.03561  -14.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7497 on 178 degrees of freedom
Multiple R-squared:  0.5573,    Adjusted R-squared:  0.5548 
F-statistic:   224 on 1 and 178 DF,  p-value: < 2.2e-16

\(\sqrt{Var(\varepsilon_i)}=\sqrt{\sigma^2_\varepsilon}=\hat \sigma_\varepsilon\)

Even if the true parameters of the model is known and this linear model is the true model, this number shows how prediction deviates on average from the true value of dependent variable.

One may want to assess this number by calculating the percent error (ratio of RSE to mean value of dependent variable)

RSE is considered as lack of fit

R-squared & F-test

\(R^2\) is the percentage of the dependent variable’s variance explained by independent variable(s). It must be then,

\(1-\frac{var(residuals)}{var(dependent variable)}\)

where, var(.): variance

To calculate this by hand (in R):

Variance(residuals)=var(model$residual)=0.559

Variance(dependent)=var(ln(journal$subs))=1.2625

\(1-\frac{var(residuals)}{var(dependent variable)}= 1-\frac{0.559}{1.2625}= 0.557\)

\(Adjusted \: R^2={R_{adj}^2 = 1 - [\frac{(1-R^2)(n-1)}{n-k-1}]}\)

where, \(n\) is the number of observations and \(k\) is the number of independent variables. \(R^2\) increases with addition of new variables. \(R_{adj}^2\) increases with only if the addition of new independent variables that improves the model. This is done by penalizing with the number of independent variables.

Hence,

\(n=number \: of \: observations=180\)

\(k=number \: of \: dependent \: variables=1\)

\({R_{adj}^2 = 1 - [\frac{(1-0.557)(180-1)}{180-1-1}]}=0.555\)

F-Test

\(H_0:\hat{\beta_1}=\hat{\beta_2}=\hat{\beta_3}=..=\hat{\beta_p}=0\)

\(H_1: \text{at least one } \hat{\beta_j} \neq 0, j\neq 0\)

\(\text{Reduced Model: } y_i=\beta_0+\varepsilon_i, \; df=n-1\)

\(\text{Full Model: } y_i=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p+\varepsilon_i,\; df=n-p+1\)

\(F = \left(\frac{RSS_0-RSS_1}{RSS_1}\right)\left(\frac{n-p_1}{p_1-p_0}\right)=\left(\frac{RSS_0-RSS_1}{p_1-p_0}\right)\left(\frac{n-p_1}{RSS_1}\right)\)

\(\implies F =\left(\frac{RSS_0-RSS_1}{p_1-p_0}\right)/\left(\frac{RSS_1}{n-p_1}\right)=\left(\frac{RSS_0-RSS_1}{df_0-df_1}\right)/\left(\frac{RSS_1}{df_1}\right)\)

and in terms of \(R^2\)

\(=\left(\frac{R^2}{1-R^2}\right)\left(\frac{n-p_1}{p_1-p_0}\right)\)

[1] "F-Stat=224.037"

\(=\left(\frac{0.5572546}{1-0.5572546}\right)\left(\frac{180-2}{2-1}\right) = 224.037\)

Plot of Simple Linear Regression Example: Journal Pricing Data

Dummy Variable

Dummy Variable

\(ln(wage)=\beta_0 + \beta_1 \: experience+ \beta_2 \: experience^2 + \beta_3 \: education + \beta_4 \: ethnicity + \varepsilon\)


Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education + 
    ethnicity, data = CPS1988)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9428 -0.3162  0.0580  0.3756  4.3830 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.321e+00  1.917e-02  225.38   <2e-16 ***
experience       7.747e-02  8.800e-04   88.03   <2e-16 ***
I(experience^2) -1.316e-03  1.899e-05  -69.31   <2e-16 ***
education        8.567e-02  1.272e-03   67.34   <2e-16 ***
ethnicityafam   -2.434e-01  1.292e-02  -18.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5839 on 28150 degrees of freedom
Multiple R-squared:  0.3347,    Adjusted R-squared:  0.3346 
F-statistic:  3541 on 4 and 28150 DF,  p-value: < 2.2e-16

How to Compare Models

Analysis of Variance Table

Model 1: log(wage) ~ experience + I(experience^2) + education
Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1  28151 9719.6                                  
2  28150 9598.6  1    121.02 354.91 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How to Compare Models

Number of observations is large

\(AIC=-2\:ln(Likelihood) + 2\:p, \: n:number \: of \:observations\)

Otherwise correction to get second-order AIC, AICc,

\(AICc=-2\:ln(Likelihood) + 2\:p + \frac{2p(p+1)}{n -(p+1)}\) )

Bayesian Information Criterion

\(BIC=-2\:ln(Likelihood) + ln(n)\:p\)

AIC with Ethnicity 49614.68 
AIC without Ethnicity 49965.43 
BIC with Ethnicity 49664.15 
BIC with Ethnicity 50006.66 

Partially Linear Models

\(ln(wage)=\beta_0 + G(experience) + \beta_2 \: education + \beta_3 \: ethnicity + \varepsilon\)

Partially Linear Models


Call:
lm(formula = log(wage) ~ bs(experience, df = 5) + education + 
    ethnicity, data = CPS1988)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9315 -0.3079  0.0565  0.3672  3.9945 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              2.775582   0.056081   49.49   <2e-16 ***
bs(experience, df = 5)1  1.891673   0.075814   24.95   <2e-16 ***
bs(experience, df = 5)2  2.259468   0.046474   48.62   <2e-16 ***
bs(experience, df = 5)3  2.824582   0.070773   39.91   <2e-16 ***
bs(experience, df = 5)4  2.373082   0.065205   36.39   <2e-16 ***
bs(experience, df = 5)5  1.739341   0.119691   14.53   <2e-16 ***
education                0.088181   0.001258   70.07   <2e-16 ***
ethnicityafam           -0.248202   0.012725  -19.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5747 on 28147 degrees of freedom
Multiple R-squared:  0.3557,    Adjusted R-squared:  0.3555 
F-statistic:  2220 on 7 and 28147 DF,  p-value: < 2.2e-16

Partially Linear Models

Regr. Without Ethn. With Splines
AIC 49614.6785 49965.4297 48720.0162
Education 0.0857 0.0874 0.0882

Weighted Least Squares (Reading Only)

\(\sum_{i=1}^{n}(y_i-\beta_0-\beta_ix_i)^2\) to \(\sum_{i=1}^{n}w_i(y_i-\beta_0-\beta_ix_i)^2\)

where \(w_i\) are the weights of each observations.

\(E(\varepsilon_i^2|x_i,z_i)=h(z_i^T\gamma)\) where \(h()\) called skedastic function.

Some popular specifications

\(E(\varepsilon_i^2|x_i,z_i)=\sigma^2z_i^2\implies w_i=1/z_i\)

For this example, we may start with

\[E(\varepsilon^2|x_i,z_i)=\sigma^2z_i^{\gamma_2}\]

\[E(\varepsilon_i^2|x_i,z_i)=\sigma^2 z_i^{\gamma_2}= e^{\gamma_1+\gamma_2 log(x_i)}\]

then, fit \(ln(\varepsilon^2)=\gamma_1+\gamma_2ln(x_i)+v_i\), and set \(w_i=\frac{1}{e^{[\hat \gamma_1+\hat \gamma_2ln(x_i))]}}\)

And one can also add and iterative process so that it starts with the coefficient obtained above, then refit the model using new error terms until the change in coefficient becomes so small between each iteration.