Linear Models (Regressions): An Overview

I. Ozkan

Fall 2025

Preliminary Readings

Using R for Introductory Econometrics, Florian Heiss, pp:81-158
Introduction to Econometrics with R, Christoph Hanck, Martin Arnold, Alexander Gerber, and Martin Schmelzer, Chapters:4-7
An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 3

Learning Objectives

Part of Supervised Learning (input(s)-output):

Keywords:

Inputs: independent variable, covariates, predictors, features, regressors.
Output: dependent variable, variates, labels, regressand.
Review of Basics of Linear Regression (we do not go into the details, refer to econometrics course notes).
Regularization
Ridge Regression
Probit Regression
Logit Regression
Time Series Data (skipped!!)

Linear Models (Linear Regression)

Linear Approximation may be used to approximate many process successfully
Can be combined to create a larger model (system)
Possible to obtain analytical solution (may not the case for larger system)
Linear regression provides an introduction for many of the core machine learning concepts

Advertising Data

Let’s look at some observations with a plot of all data.

TV	radio	newspaper	sales
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5
180.8	10.8	58.4	12.9
8.7	48.9	75.0	7.2

Linear Model, TV Advertising vs Sales

Linear Models

Linear Models, Least Square Estimation

Let:

$\{(x_1, y_1), (x_2,y_2), \cdots , (x_n, y_n)\}$ are the $n$ input-output pairs. $x_i \in \mathbb{R^k}$ is a vector with $k$ attributes.

Output (target), $y_i \in \mathbb{R}$ is univariate.

Then the linear model is;

\[y_i=\theta_0+\theta_1x_1+ \theta_2x_2+ \cdots + \theta_kx_k+\varepsilon_i\]

In some textbooks this is written with $\beta$ as:

\[y_i=\beta_0+\beta_1x_1+ \beta_2x_2+ \cdots + \beta_kx_k+\varepsilon_i\]

The estimation problem is then to find $\hat \theta$ (or $\hat \beta$) with some regular assumptions (lets skip them and refer to Econometrics course).

Find parameter, $\hat \theta$, vector by minimizing Loss (cost, error etc) function, $L(\theta)$. Loss function can be written as:

\[L(\theta)=\sum_{}^{}\varepsilon_i^2\]

\[L(\theta)=\sum_{}^{}[y_i-(\hat\theta_0+\hat\theta_1x_1+ \hat\theta_2x_2+ \cdots + \hat\theta_kx_k)]^2\]

Then $\hat \theta$ can be estimated by:

$\operatorname*{arg\,min}_\theta L(\theta)$

The loss function can be written in matrix form (using more familiar $\beta$ term instead of $\theta$):

\[L(\beta)=\varepsilon^2=\underbrace{(Y-X\beta)^T}_{\varepsilon^T_{(1xn)}}\underbrace{(Y-X\beta)}_{\varepsilon_{(nx1)}}\]

Both the matrix differentiation (hint: $\frac{\partial X \beta}{\partial \beta}=X^T \: and \: \frac{\partial \beta^T X \beta}{\partial \beta}=2X^T \beta$) necessary for finding the optimal $\beta$ vector or matrix manipulation result in the same estimation.

\[\frac {\partial L(\beta)}{\partial \beta}=\frac {\partial }{\partial \beta}\big[Y^TY+\beta^TX^TX\beta-2Y^TX\beta\big]=0\]

\[\implies \frac {\partial L(\beta)}{\partial \beta}=0+2X^TX\beta-2X^TY=0\]
\[2X^TX\beta=2X^TY \implies \hat\beta=(X^TX)^{-1}X^TY\]

Linear Models, Maximum Likelihood Estimation

Likelihood

Let, $X\sim N(\mu, \sigma^2)$ then the distribution function is,

\[P(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{-(x - \mu)^2/2\sigma ^2 }\]
Assume that we observe, $X=\{x_1,x_2,\cdots,x_n\}$ and the estimation problem of $\mu,\sigma$ can be formulated as the solution of the maximization of the likelihood function,

\[P(x_1,x_2,...,x_n|parameters)\]

An easy example for illustration ($\mu$)

4 observations,

$x_1=0, \: x_2= 1, \: x_3=0.7, \: x_4=1.5$

and let assume that,

$x_i \sim N(\mu,1)=\mu+N(0,1)$

The likelihood given observations are independent is,

Using dnorm(.) function of R, let’s calculate with different parameter values for $\mu$

Hint:

density function for $N(\mu,\sigma^2)$

\[\frac{1}{{\sigma \sqrt {2\pi } }}e^{-(x - \mu)^2/2\sigma ^2 }\]

hence, for the sake of simplicity, for example, density for standard normal distribution, $N\sim(0,1)$

\[\frac{1}{\sqrt{2\pi}}e^\frac{-x^2}{2}\]

$P(x=1|\mu=0, \sigma=1)=dnorm(1) = \frac{1}{\sqrt{2\pi}}e^\frac{-1}{2}=0.242$

Assume that, $\mu=1.5, \sigma=1$,

$P(x=0|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(0-1.5)^2}{2}=0.13$

$P(x=1|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(1-1.5)^2}{2}=0.352$

$P(x=0.7|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(0.7-1.5)^2}{2}=0.29$

$P(x=1.5|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(1.5-1.5)^2}{2}=0.399$

$\mu=1.5 \implies P(x_1=0,x_2=1,x_3=0.7,x_4=1.5|\mu=1.5,\sigma=1)=0.0053$

If we set $\mu$ different value, for example, let $\mu=1$,

$P(x_1=0,x_2=1,x_3=0.7,x_4=1.5|\mu=1,\sigma=1)=0.013$

which one is higher? And what would be $\mu$ value that maximizes likelihood (hint, mean value!).
$P(x_1,x_2,x_3,x_4|\mu=0.8,\sigma=1)=0.014$

The likelihood for linear regression

Let, $Y$,

\[y_i \sim N(x_i^T\beta, \sigma^2)=x_i^T\beta+N(0, \sigma^2)\]

\[P(Y|X,\beta,\sigma)=\prod_{1}^{n}P(y_i|x_i,\beta,\sigma)\]

\[P(Y|X,\theta,\sigma)=\prod_{1}^{n}(2 \pi \sigma^2)^{-1/2} e^{-\frac {1}{2\sigma^2}(y_i-x_i^T\beta)^2}\]

\[P(Y|X,\theta,\sigma)=(2 \pi \sigma^2)^{-n/2} e^{-\frac {1}{2\sigma^2}(\sum_{i=n}^{n}y_i-x_i^T\beta)^2}\]

The next reasonable step to take the logarithm of this function ($P>0$) which is called Log-Likelihood.

\[Log(L(\beta))=l(\beta)=-\frac{n}{2}ln(2 \pi \sigma^2)- \frac{1}{2 \sigma^2}(Y-X\beta)^T(Y-X\beta)\] Take the partial derivatives, (i) assuming $\sigma$ is known to find $\beta$ (should result in the same equation as Least Squares), and (ii) assuming $\beta$ is known to find $sigma$ (should result in the estimator of Variance,

$E[(y_i-x_i^T\beta)^2]=\frac{1}{n}(Y-X\beta)^T(Y-X\beta)$)

Then prediction given data, $(X_*,y)$ is,

\[P(y|X_*,\beta,\sigma)=N(y|X_*^T\beta_{ML},\sigma^2)\]

Maximum Likelihood Estimation

Consistent

$\hat \beta \xrightarrow[{}]{p} \beta$

or

$plim(\hat \beta)=\beta$

or

$\lim\limits_{n \to \infty} p(|\hat \beta-\beta|>\alpha) \to 0$

Asymptotically normal (as $n \to \infty$)
Asymptotically Efficient (has the lowest variance)
Please do refer to Econometrics lecture for the definition of the bias, variance of the estimator.

$bias(\hat \beta)=E_{p(Data|\beta)}(\hat \beta)-\beta$

$V(\hat \beta)=E_{p(Data|\beta)}(\hat \beta -\beta)^2$

Advertising Data

Let $RSS=\varepsilon_1^2+\varepsilon_2^2+\cdots+\varepsilon_n^2=\sum_{1}^{n}\varepsilon_i^2$

Linear Models

Here is an example of multi-variate regression graph (2 regressors)

Simple Linear Regression Example (using R)

This example is from Applied Econometrics with R, Kleiber, Christian, Zeileis, Achim, Chapter 3
Journal Pricing Data will be used

# install AER package only once 
# install.packages("AER")
library(AER)
# data set name is Journals
data("Journals")
# number of rows and columns (180x10) 180 obs. with 10 variables
dim(Journals)

[1] 180  10

# name of the columns
colnames(Journals)

 [1] "title"        "publisher"    "society"      "price"        "pages"       
 [6] "charpp"       "citations"    "foundingyear" "subs"         "field"

# subs and price columns needed, create a new data with only these cols
journals <- Journals[, c("subs", "price")]
# obtain price/citations (here observe $ usage to access the columns of data frame)
# citeprice column added to the data frame
journals$citeprice <- Journals$price/Journals$citations
# start with summary
summary(journals)

      subs            price          citeprice        
 Min.   :   2.0   Min.   :  20.0   Min.   : 0.005223  
 1st Qu.:  52.0   1st Qu.: 134.5   1st Qu.: 0.464495  
 Median : 122.5   Median : 282.0   Median : 1.320513  
 Mean   : 196.9   Mean   : 417.7   Mean   : 2.548455  
 3rd Qu.: 268.2   3rd Qu.: 540.8   3rd Qu.: 3.440171  
 Max.   :1098.0   Max.   :2120.0   Max.   :24.459460

# may be boxplot (to check the summary of distr.)
boxplot(journals)

Simple Linear Regression Example: Journal Pricing Data

Check the distribution (and conclude if transformation is needed)

library(GGally)
# or best use ggpairs() function of GGally package
ggpairs(journals)

Simple Linear Regression Example: Journal Pricing Data

subs vs citeprice graph

formula <- y ~ poly(x, 1, raw = TRUE)

ggplot(journals, aes(citeprice, subs)) +
  geom_point() + 
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)), 
               label.x = "right", label.y = "bottom", hjust = "inward",
               formula = formula, parse = TRUE) + 
    stat_fit_tb(method = "lm",
              method.args = list(formula = formula),
              tb.vars = c(Parameter = "term", 
                          Estimate = "estimate", 
                          "s.e." = "std.error", 
                          "italic(t)" = "statistic", 
                          "italic(P)" = "p.value"),
              label.y = "top", label.x = "right",
              parse = TRUE) + 
  ggtitle("subs vs citeprice") + 
  xlab("citeprice") + 
  ylab("subs") + 
  theme_bw()

$ln(subs)$ vs $ln(citeprice)$ graph

# convert using log (here is ln) transformation
journals$lsubs <- log(journals$subs)
journals$lciteprice <- log(journals$citeprice)


formula <- y ~ poly(x, 1, raw = TRUE)

ggplot(journals, aes(lciteprice, lsubs)) +
  geom_point() + 
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = stat(eq.label)), 
               label.x = "right", label.y = "bottom", hjust = "inward",
               formula = formula, parse = TRUE) + 
      stat_fit_tb(method = "lm",
              method.args = list(formula = formula),
              tb.vars = c(Parameter = "term", 
                          Estimate = "estimate", 
                          "s.e." = "std.error", 
                          "italic(t)" = "statistic", 
                          "italic(P)" = "p.value"),
              label.y = "bottom", label.x = "left",
              parse = TRUE) + 
  ggtitle("ln(subs) vs ln(citeprice)") + 
  xlab("ln(citeprice)") + 
  ylab("ln(subs)") + 
  theme_bw()

Simple Linear Regression Example: Journal Pricing Data

Regression with Log-transformation (here is log-log) is better suited
The equation:

\[ln(subs)_i=\beta_0+\beta_1 ln(citeprice)_i + \varepsilon_i\]

# lm() is to estimate linear model
jour_lm <- lm(log(subs) ~ log(citeprice), data = journals)

# class type
class(jour_lm)

[1] "lm"

# structure 
# try this: str(jour_lm)

# names inside
names(jour_lm)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

summary(jour_lm)


Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.72478 -0.53609  0.03721  0.46619  1.84808 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.76621    0.05591   85.25   <2e-16 ***
log(citeprice) -0.53305    0.03561  -14.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7497 on 178 degrees of freedom
Multiple R-squared:  0.5573,    Adjusted R-squared:  0.5548 
F-statistic:   224 on 1 and 178 DF,  p-value: < 2.2e-16

Call: The formula with associated data. Here data=journal formula as given above equation.
Residuals (error terms etc): 5-number summary of $\varepsilon_i$
Coefficients: $\beta_0: \: intercept, \: \beta_1:\: ln(citeprice)$,
- Estimate: $\hat \beta_0$ and $\hat \beta_1$
- Std. Error: Standard Error for Estimates
- t value: t-stats for Estimations, $t−value = {\hat \beta_i \over Std.Error}$
- Pr(>|t|): p-value for hypothesis test, a small p-value rules out the association is due to chance
Residual Standard Error, $RSE=\sqrt{\frac{1}{n-2}RSS}$:

$\sqrt{Var(\varepsilon_i)}=\sqrt{\sigma^2_\varepsilon}=\hat \sigma_\varepsilon$

Even if the true parameters of the model is known and this linear model is the true model, this number shows how prediction deviates on average from the true value of dependent variable.

One may want to assess this number by calculating the percent error (ratio of RSE to mean value of dependent variable)

RSE is considered as lack of fit

Degrees of freedom: $n-p=number \: of \: observation-number \: of \: parameters$
R-squared: $R^2$ and Adjusted R-squared
F-statistics and its p-value

Memory Recall R-squared

$R^2$ is the percentage of the dependent variable’s variance explained by independent variable(s). It must be then,

$1-\frac{var(residuals)}{var(dependent variable)}$

where, var(.): variance

To calculate this by hand (in R):

Variance(residuals)=var(model$residual)=0.559

Variance(dependent)=var(ln(journal$subs))=1.2625

$1-\frac{var(residuals)}{var(dependent variable)}= 1-\frac{0.559}{1.2625}= 0.557$

$Adjusted \: R^2={R_{adj}^2 = 1 - [\frac{(1-R^2)(n-1)}{n-k-1}]}$

where, $n$ is the number of observations and $k$ is the number of independent variables. $R^2$ increases with addition of new variables. $R_{adj}^2$ increases with only if the addition of new independent variables that improves the model. This is done by penalizing with the number of independent variables.

Hence,

$n=number \: of \: observations=180$

$k=number \: of \: dependent \: variables=1$

${R_{adj}^2 = 1 - [\frac{(1-0.557)(180-1)}{180-1-1}]}=0.555$

Plot of Simple Linear Regression Example: Journal Pricing Data

par(mfrow = c(2, 2))
plot(jour_lm)

par(mfrow = c(1, 1))

Hypothesis Test Example: Journal Pricing Data

Test the hypothesis that the elasticity of the number of library subscriptions with respect to the price per citation equals −0.5: $H_0: \beta_1=-0.5$.

linearHypothesis(jour_lm, "log(citeprice) = -0.5")


Linear hypothesis test:
log(citeprice) = - 0.5

Model 1: restricted model
Model 2: log(subs) ~ log(citeprice)

  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    179 100.54                           
2    178 100.06  1   0.48421 0.8614 0.3546

p-value indicates it is not significantly different from -0.5

Multiple Linear Regression

Applied Econometrics with R, Kleiber, Christian, Zeileis, Achim, Chapter 3
Data set: CPS1988 data frame collected in the March 1988 Current Population Survey

# data set name is Journals
data("CPS1988")
# number of rows and columns (28155x7) 28155 obs. with 7 variables
dim(CPS1988)

[1] 28155     7

# name of the columns
colnames(CPS1988)

[1] "wage"       "education"  "experience" "ethnicity"  "smsa"      
[6] "region"     "parttime"

# start with summary
summary(CPS1988)

      wage            education       experience   ethnicity     smsa      
 Min.   :   50.05   Min.   : 0.00   Min.   :-4.0   cauc:25923   no : 7223  
 1st Qu.:  308.64   1st Qu.:12.00   1st Qu.: 8.0   afam: 2232   yes:20932  
 Median :  522.32   Median :12.00   Median :16.0                           
 Mean   :  603.73   Mean   :13.07   Mean   :18.2                           
 3rd Qu.:  783.48   3rd Qu.:15.00   3rd Qu.:27.0                           
 Max.   :18777.20   Max.   :18.00   Max.   :63.0                           
       region     parttime   
 northeast:6441   no :25631  
 midwest  :6863   yes: 2524  
 south    :8760              
 west     :6091

# may be boxplot (to check the summary of distr.)
ggpairs(CPS1988)

wage (is the wage in dollars per week), education (years) and experience (years) are numbers, ethnicity is a factor with levels Caucasian (cauc) and African-American (afam).
Experience = age - education - 6 (hence potential experience, not actual experience in this data)

Multiple Linear Regression

Let’s drop subscript $i$ which is for individual and write a general model of interest in this examples as:

\[ln(wage)=\beta_0 + \beta_1 \: experience+ \beta_2 \: experience^2 + \beta_3 \: education + \beta_4 \: ethnicity + \varepsilon\]
- Please note I() function below. I(experience^2)=$experience^2$

# lm() is to estimate linear model
cps_lm <- lm(log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988)

summary(cps_lm)


Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education + 
    ethnicity, data = CPS1988)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9428 -0.3162  0.0580  0.3756  4.3830 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.321e+00  1.917e-02  225.38   <2e-16 ***
experience       7.747e-02  8.800e-04   88.03   <2e-16 ***
I(experience^2) -1.316e-03  1.899e-05  -69.31   <2e-16 ***
education        8.567e-02  1.272e-03   67.34   <2e-16 ***
ethnicityafam   -2.434e-01  1.292e-02  -18.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5839 on 28150 degrees of freedom
Multiple R-squared:  0.3347,    Adjusted R-squared:  0.3346 
F-statistic:  3541 on 4 and 28150 DF,  p-value: < 2.2e-16

all coefficients have the expected sign, and the corresponding variables are highly significant (not surprising in a sample as large as the present one). Specifically, according to this specification, the return on education is 8.57% per year.
ethnicity appears as ethnicityafam (or better to say ethnicity==afam). But ethnicity may have a level of cauc (called as treatment contrast, treatment afam is compared with the reference group cauc. dummy variable (or indicator variable) for the level afam in econometric jargon.

Multiple Linear Regression

library(jtools)
stargazer(cps_lm, type = 'html')


	Dependent variable:

	log(wage)

experience	0.077^***
	(0.001)

I(experience2)	-0.001^***
	(0.00002)

education	0.086^***
	(0.001)

ethnicityafam	-0.243^***
	(0.013)

Constant	4.321^***
	(0.019)


Observations	28,155
R²	0.335
Adjusted R²	0.335
Residual Std. Error	0.584 (df = 28150)
F Statistic	3,541.036^*** (df = 4; 28150)

Note:	p<0.1; p<0.05; p<0.01

plot_summs(cps_lm)

How to Compare Models

For any two nested models, this can be done using the function anova()
For example to test the relevance of the variable ethnicity

# linear model with no ethnicity 
cps_noeth <- lm(log(wage) ~ experience + I(experience^2) + education, data = CPS1988)

anova(cps_noeth, cps_lm)

Analysis of Variance Table

Model 1: log(wage) ~ experience + I(experience^2) + education
Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1  28151 9719.6                                  
2  28150 9598.6  1    121.02 354.91 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# can be done using waldtest() function 
waldtest(cps_lm, . ~ . - ethnicity)

Wald test

Model 1: log(wage) ~ experience + I(experience^2) + education + ethnicity
Model 2: log(wage) ~ experience + I(experience^2) + education
  Res.Df Df      F    Pr(>F)    
1  28150                        
2  28151 -1 354.91 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How to Compare Models

The Akaike’s Information Criterion, AIC, and the Bayesian Information Criterion - BIC (sometimes called as Schwarz criterion)

Number of observations is large

$AIC=-2\:ln(Likelihood) + 2\:p, \: n:number \: of \:observations$

Otherwise correction to get second-order AIC, AICc,

$AICc=-2\:ln(Likelihood) + 2\:p + \frac{2p(p+1)}{n -(p+1)}$ )

Bayesian Information Criterion

$BIC=-2\:ln(Likelihood) + ln(n)\:p$

Minimum AIC or BIC is selected

# AIC 
AIC(cps_lm)

[1] 49614.68

AIC(cps_noeth)

[1] 49965.43

# BIC 
BIC(cps_lm)

[1] 49664.15

BIC(cps_noeth)

[1] 50006.66