Linear Models (Regressions): Overview

I. Ozkan

Spring 2025

Preliminary Readings

Learning Objectives

Keywords:

Linear Models (Linear Regression)

Advertising Data

Advertising Data
TV radio newspaper sales
230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9
8.7 48.9 75.0 7.2
57.5 32.8 23.5 11.8
120.2 19.6 11.6 13.2
8.6 2.1 1.0 4.8
199.8 2.6 21.2 10.6

Linear Model, TV Advertising vs Sales

Linear Models

Linear Models

Linear Models, Estimation

Least Square Estimation

Let:

\(\{(x_1, y_1), (x_2,y_2), \cdots , (x_n, y_n)\}\) are the \(n\) input-output pairs. \(x_i \in \mathbb{R^k}\) is a vector with \(k\) attributes.

Output (target), \(y_i \in \mathbb{R}\) is univariate.

Then the linear model is;

\[y_i=\theta_0+\theta_1x_1+ \theta_2x_2+ \cdots + \theta_kx_k+\varepsilon_i\]

In some textbooks this is written with \(\beta\) as:

\[y_i=\beta_0+\beta_1x_1+ \beta_2x_2+ \cdots + \beta_kx_k+\varepsilon_i\]

The estimation problem is then to find \(\hat \theta\) (or \(\hat \beta\)) with some regular assumptions (lets skip them and refer to Econometrics course).

Find parameter, \(\hat \theta\), vector by minimizing Loss (cost, error etc) function, \(L(\theta)\). Loss function can be written as:

\[L(\theta)=\sum_{}^{}\varepsilon_i^2\]

\[L(\theta)=\sum_{}^{}[y_i-(\hat\theta_0+\hat\theta_1x_1+ \hat\theta_2x_2+ \cdots + \hat\theta_kx_k)]^2\]

Then \(\hat \theta\) can be estimated by:

\(\operatorname*{arg\,min}_\theta L(\theta)\)

The loss function can be written in matrix form (using more familiar \(\beta\) term instead of \(\theta\)):

\[L(\beta)=\varepsilon^2=\underbrace{(Y-X\beta)^T}_{\varepsilon^T_{(1xn)}}\underbrace{(Y-X\beta)}_{\varepsilon_{(nx1)}}\]

Both the matrix differentiation vector or matrix manipulation result in the same estimation of \(\beta\).

\[\hat\beta=(X^TX)^{-1}X^TY\]

Maximum Likelihood Estimation

Likelihood

Let, \(X\sim N(\mu, \sigma^2)\) then the distribution function is,

\[P(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{-(x - \mu)^2/2\sigma ^2 }\]
Assume that we observe, \(X=\{x_1,x_2,\cdots,x_n\}\) and the estimation problem of \(\mu,\sigma\) can be formulated as the solution of the maximization of the likelihood function,

\[P(x_1,x_2,...,x_n|parameters)\]

Let, \(Y\),

\[y_i \sim N(x_i^T\beta, \sigma^2)=x_i^T\beta+N(0, \sigma^2)\]

\[P(Y|X,\beta,\sigma)=\prod_{1}^{n}P(y_i|x_i,\beta,\sigma)\]

\[P(Y|X,\theta,\sigma)=\prod_{1}^{n}(2 \pi \sigma^2)^{-1/2} e^{-\frac {1}{2\sigma^2}(y_i-x_i^T\beta)^2}\]

\[P(Y|X,\theta,\sigma)=(2 \pi \sigma^2)^{-n/2} e^{-\frac {1}{2\sigma^2}(\sum_{i=n}^{n}y_i-x_i^T\beta)^2}\]

Logarithm of this function (\(P>0\)) which is called Log-Likelihood.

\[Log(L(\beta))=l(\beta)=-\frac{n}{2}ln(2 \pi \sigma^2)- \frac{1}{2 \sigma^2}(Y-X\beta)^T(Y-X\beta)\] Take the partial derivatives, (i) assuming \(\sigma\) is known to find \(\beta\) (should result in the same equation as Least Squares), and (ii) assuming \(\beta\) is known to find \(sigma\) (should result in the estimator of Variance,

\(E[(y_i-x_i^T\beta)^2]=\frac{1}{n}(Y-X\beta)^T(Y-X\beta)\))

Then prediction given data, \((X_*,y)\) is,

\[P(y|X_*,\beta,\sigma)=N(y|X_*^T\beta_{ML},\sigma^2)\]

An Example: Maximum Likelihood Estimation Visualization for Simple Linear Model

Advertising Data

Let \(RSS=\varepsilon_1^2+\varepsilon_2^2+\cdots+\varepsilon_n^2=\sum_{1}^{n}\varepsilon_i^2\)

Linear Models

Simple Linear Regression Example (using R)

[1] 180  10
 [1] "title"        "publisher"    "society"      "price"        "pages"       
 [6] "charpp"       "citations"    "foundingyear" "subs"         "field"       
      subs            price          citeprice        
 Min.   :   2.0   Min.   :  20.0   Min.   : 0.005223  
 1st Qu.:  52.0   1st Qu.: 134.5   1st Qu.: 0.464495  
 Median : 122.5   Median : 282.0   Median : 1.320513  
 Mean   : 196.9   Mean   : 417.7   Mean   : 2.548455  
 3rd Qu.: 268.2   3rd Qu.: 540.8   3rd Qu.: 3.440171  
 Max.   :1098.0   Max.   :2120.0   Max.   :24.459459  

\[ln(subs)_i=\beta_0+\beta_1 ln(citeprice)_i + \varepsilon_i\]

[1] "lm"
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        

Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.72478 -0.53609  0.03721  0.46619  1.84808 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.76621    0.05591   85.25   <2e-16 ***
log(citeprice) -0.53305    0.03561  -14.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7497 on 178 degrees of freedom
Multiple R-squared:  0.5573,    Adjusted R-squared:  0.5548 
F-statistic:   224 on 1 and 178 DF,  p-value: < 2.2e-16

\(\sqrt{Var(\varepsilon_i)}=\sqrt{\sigma^2_\varepsilon}=\hat \sigma_\varepsilon\)

Even if the true parameters of the model is known and this linear model is the true model, this number shows how prediction deviates on average from the true value of dependent variable.

RSE is considered as lack of fit

Memory Recall R-squared

\(R^2\) is the percentage of the dependent variable’s variance explained by independent variable(s). It must be then,

\(1-\frac{var(residuals)}{var(dependent variable)}\)

where, var(.): variance

\(Adjusted \: R^2={R_{adj}^2 = 1 - [\frac{(1-R^2)(n-1)}{n-k-1}]}\)

\(n\) is the number of observations and \(k\) is the number of independent variables. \(R^2\) increases with addition of new variables. \(R_{adj}^2\) increases with only if the addition of new independent variables that improves the model. This is done by penalizing with the number of independent variables.

Plot of Simple Linear Regression Example: Journal Pricing Data

Another Example

-Data Set: CPS1988 data frame March, 1988 Current Population Survey, US Census Bureau

'data.frame':   28155 obs. of  7 variables:
 $ wage      : num  355 123 370 755 594 ...
 $ education : int  7 12 9 11 12 16 8 12 12 14 ...
 $ experience: int  45 1 9 46 36 22 51 34 0 18 ...
 $ ethnicity : Factor w/ 2 levels "cauc","afam": 1 1 1 1 1 1 1 1 1 1 ...
 $ smsa      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ region    : Factor w/ 4 levels "northeast","midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ parttime  : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
      wage            education       experience   ethnicity     smsa      
 Min.   :   50.05   Min.   : 0.00   Min.   :-4.0   cauc:25923   no : 7223  
 1st Qu.:  308.64   1st Qu.:12.00   1st Qu.: 8.0   afam: 2232   yes:20932  
 Median :  522.32   Median :12.00   Median :16.0                           
 Mean   :  603.73   Mean   :13.07   Mean   :18.2                           
 3rd Qu.:  783.48   3rd Qu.:15.00   3rd Qu.:27.0                           
 Max.   :18777.20   Max.   :18.00   Max.   :63.0                           
       region     parttime   
 northeast:6441   no :25631  
 midwest  :6863   yes: 2524  
 south    :8760              
 west     :6091              
                             
                             

Multiple Linear Regression

\[ln(wage_i)=\beta_0 + \beta_1 \: experience_i+ \beta_2 \: experience_i^2 + \beta_3 \: education_i + \beta_4 \: ethnicity_i + \varepsilon\]


Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education + 
    ethnicity, data = CPS1988)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9428 -0.3162  0.0580  0.3756  4.3830 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.321e+00  1.917e-02  225.38   <2e-16 ***
experience       7.747e-02  8.800e-04   88.03   <2e-16 ***
I(experience^2) -1.316e-03  1.899e-05  -69.31   <2e-16 ***
education        8.567e-02  1.272e-03   67.34   <2e-16 ***
ethnicityafam   -2.434e-01  1.292e-02  -18.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5839 on 28150 degrees of freedom
Multiple R-squared:  0.3347,    Adjusted R-squared:  0.3346 
F-statistic:  3541 on 4 and 28150 DF,  p-value: < 2.2e-16

Multiple Linear Regression

Dependent variable:
log(wage)
experience 0.077***
(0.001)
I(experience2) -0.001***
(0.00002)
education 0.086***
(0.001)
ethnicityafam -0.243***
(0.013)
Constant 4.321***
(0.019)
Observations 28,155
R2 0.335
Adjusted R2 0.335
Residual Std. Error 0.584 (df = 28150)
F Statistic 3,541.036*** (df = 4; 28150)
Note: p<0.1; p<0.05; p<0.01
plot_summs(cps_lm, scale = TRUE, plot.distributions = TRUE)

How to Compare Models

Number of observations is large

\(AIC=-2\:ln(Likelihood) + 2\:p, \: n:number \: of \:observations\)

Otherwise correction to get second-order AIC, AICc,

\(AICc=-2\:ln(Likelihood) + 2\:p + \frac{2p(p+1)}{n -(p+1)}\) )

Bayesian Information Criterion

\(BIC=-2\:ln(Likelihood) + ln(n)\:p\)

What About Coefficient Distributions

Dependent variable:
Income
Frost -1.254
(2.110)
Illiteracy -610.715***
(213.138)
Murder 23.074
(30.940)
Constant 5,111.097***
(416.576)
Observations 50
R2 0.209
Adjusted R2 0.157
Residual Std. Error 564.079 (df = 46)
F Statistic 4.049** (df = 3; 46)
Note: p<0.1; p<0.05; p<0.01


Adding HS Grad Variable

Dependent variable:
Income
Frost -0.241
(1.872)
Illiteracy -190.516
(217.049)
Murder 30.005
(27.235)
HS Grad 44.975***
(11.758)
Constant 2,074.005**
(874.221)
Observations 50
R2 0.403
Adjusted R2 0.350
Residual Std. Error 495.427 (df = 45)
F Statistic 7.594*** (df = 4; 45)
Note: p<0.1; p<0.05; p<0.01

Here is the comparison of both model parameters

Prediction

predict the specific outcomes for given independent variables

\(y = x^T\beta+\varepsilon\)

\(\implies \hat{y} = x^T\hat{\beta}\) since \(E[\varepsilon]=0\)

The confidence interval for predicting specific outcome is given as:

\(\hat{y}^*\pm t_{n-p}^{(\alpha/2)}\hat{\sigma}\sqrt{x^{*T}(X^TX)^{-1}x^* + 1}\)

predict the mean response for given independent variables

The point estimate is the same, \(\hat{y} = x^T\hat{\beta}\). This time only the uncertainty of \(\hat{\beta}\), \(Var(\hat{\beta})\) must be considered

The confidence interval for the mean response is given as:

\(\hat{y}^*\pm t_{n-p}^{(\alpha/2)}\hat{\sigma}\sqrt{x{*^T}(X^TX)^{-1}x^*}\)

where, \(t_{n-p}^{\alpha/2}\) is a t-statistic with \(n-p\) degrees of freedom and \(199(1-\alpha)\) is the confidence level

See the help page of predict.lm() function


Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.72478 -0.53609  0.03721  0.46619  1.84808 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.76621    0.05591   85.25   <2e-16 ***
log(citeprice) -0.53305    0.03561  -14.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7497 on 178 degrees of freedom
Multiple R-squared:  0.5573,    Adjusted R-squared:  0.5548 
F-statistic:   224 on 1 and 178 DF,  p-value: < 2.2e-16


Prediction of Specific Outcome for citeprice=2, level=0.95
       fit      lwr      upr
1 4.180593 2.695128 5.666058


Prediction of Mean Response for citeprice=2, level=0.95
       fit      lwr      upr
1 4.180593 4.047897 4.313289