Linear Models (Regressions)

I. Ozkan

Fall 2025

Preliminary Readings

Using R for Introductory Econometrics, Florian Heiss, pp:81-158
Introduction to Econometrics with R, Christoph Hanck, Martin Arnold, Alexander Gerber, and Martin Schmelzer, Chapters:4-7
An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 3

Learning Objectives

Part of Supervised Learning (input(s)-output):

Keywords:

Inputs: independent variable, covariates, predictors, features, regressors.
Output: dependent variable, variates, labels, regressand.
Review of Basics of Linear Regression (we do not go into the details, refer to econometrics course notes).

Linear Models (Linear Regression)

Linear Approximation may be used to approximate many process successfully
Can be combined to create a larger model (system)
Possible to obtain analytical solution (may not the case for larger system)
Linear regression provides an introduction for many of the core machine learning concepts
Easy to interpret

An Example, Salary & Age

Age	Salary (10000)
Age and Salary Predictions
25	13.5
55	26.0
27	10.5
35	22.0
60	24.0
65	26.5
Data can be obtained: Hull Book website

Advertising Data (ISLR Book)

sales	Media	Adv. Exp.
Advertising Data
ISLR Book
22.1	TV	230.1
22.1	radio	37.8
22.1	newspaper	69.2
10.4	TV	44.5
10.4	radio	39.3
10.4	newspaper	45.1

How to Estimate: Least Square Estimation

Let:

$\{(x_1, y_1), (x_2,y_2), \cdots , (x_n, y_n)\}$ are the $n$ input-output pairs. $x_i \in \mathbb{R^k}$ is a vector with $k$ attributes.

Output (target), $y_i \in \mathbb{R}$ is univariate.

Then the linear model is;

$y_i=\beta_0+\beta_1x_1+ \beta_2x_2+ \cdots + \beta_kx_k+\varepsilon_i$

The estimation problem is then to find $\hat \beta$ with some regular assumptions (lets skip them and refer to Data Modeling course).

Find parameter, $\hat \beta$, vector by minimizing Loss (cost, error etc) function, $L(\beta)$. Loss function can be written as:

\[L(\beta)=\sum_{}^{}\varepsilon_i^2\]

\[L(\beta)=\sum_{}^{}[y_i-(\hat\beta_0+\hat\beta_1x_1+ \hat\beta_2x_2+ \cdots + \hat\beta_kx_k)]^2\]

Then $\hat \beta$ can be estimated by:

$\operatorname*{arg\,min}_\beta L(\beta)$

The loss function can be written in matrix form:

$L(\beta)=\varepsilon^2=\underbrace{(Y-X\beta)^T}_{\varepsilon^T_{(1xn)}}\underbrace{(Y-X\beta)}_{\varepsilon_{(nx1)}}$

Both the matrix differentiation (hint: $\frac{\partial X \beta}{\partial \beta}=X^T \: and \: \frac{\partial \beta^T X \beta}{\partial \beta}=2X^T \beta$) necessary for finding the optimal $\beta$ vector or matrix manipulation result in the same estimation.

$\frac {\partial L(\beta)}{\partial \beta}=\frac {\partial }{\partial \beta}\big[Y^TY+\beta^TX^TX\beta-2Y^TX\beta\big]=0$

$\implies \frac {\partial L(\beta)}{\partial \beta}=0+2X^TX\beta-2X^TY=0$

$2X^TX\beta=2X^TY \implies \hat\beta=(X^TX)^{-1}X^TY$

How to Estimate: Maximum Likelihood Estimation

Likelihood Function

Let, $X\sim N(\mu, \sigma^2)$ then the distribution function is,

\[P(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{-(x - \mu)^2/2\sigma ^2 }\]
Assume that we observe, $X=\{x_1,x_2,\cdots,x_n\}$ and the estimation problem of $\mu,\sigma$ can be formulated as the solution of the maximization of the likelihood function,

$P(x_1,x_2,...,x_n|parameters)$

An easy example for illustration ($\mu$)

4 observations,

$x_1=0, \: x_2= 1, \: x_3=0.7, \: x_4=1.5$

and let assume that,

$x_i \sim N(\mu,1)=\mu+N(0,1)$

The likelihood given observations are independent is,

Using dnorm(.) function of R, let’s calculate with different parameter values for $\mu$

Hint:

density function for $N(\mu,\sigma^2)$

$\frac{1}{{\sigma \sqrt {2\pi } }}e^{-(x - \mu)^2/2\sigma ^2 }$

hence, for the sake of simplicity, for example, density for standard normal distribution, $N\sim(0,1)$

$\frac{1}{\sqrt{2\pi}}e^\frac{-x^2}{2}$

$P(x=1|\mu=0, \sigma=1)=dnorm(1) = \frac{1}{\sqrt{2\pi}}e^\frac{-1}{2}=0.242$

Assume that, $\mu=1.5, \sigma=1$,

$P(x=0|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(0-1.5)^2}{2}=0.13$

$P(x=1|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(1-1.5)^2}{2}=0.352$

$P(x=0.7|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(0.7-1.5)^2}{2}=0.29$

$P(x=1.5|\mu=1.5,\sigma=1) = \frac{1}{\sqrt{2\pi}}e^\frac{-(1.5-1.5)^2}{2}=0.399$

$\mu=1.5 \implies P(x_1=0,x_2=1,x_3=0.7,x_4=1.5|\mu=1.5,\sigma=1)=0.0053$

If we set $\mu$ different value, for example, let $\mu=1$,

$P(x_1=0,x_2=1,x_3=0.7,x_4=1.5|\mu=1,\sigma=1)=0.013$

which one is higher? And what would be $\mu$ value that maximizes likelihood (hint, mean value!).

$P(x_1,x_2,x_3,x_4|\mu=0.8,\sigma=1)=0.014$

The likelihood for linear regression

Let, $Y$,

$y_i \sim N(x_i^T\beta, \sigma^2)=x_i^T\beta+N(0, \sigma^2)$

$P(Y|X,\beta,\sigma)=\prod_{1}^{n}P(y_i|x_i,\beta,\sigma)$

$P(Y|X,\theta,\sigma)=\prod_{1}^{n}(2 \pi \sigma^2)^{-1/2} e^{-\frac {1}{2\sigma^2}(y_i-x_i^T\beta)^2}$

$P(Y|X,\theta,\sigma)=(2 \pi \sigma^2)^{-n/2} e^{-\frac {1}{2\sigma^2}(\sum_{i=n}^{n}y_i-x_i^T\beta)^2}$

The next reasonable step to take the logarithm of this function ($P>0$) which is called Log-Likelihood.

$Log(L(\beta))=l(\beta)=-\frac{n}{2}ln(2 \pi \sigma^2)- \frac{1}{2 \sigma^2}(Y-X\beta)^T(Y-X\beta)$

Take the partial derivatives, (i) assuming $\sigma$ is known to find $\beta$ (should result in the same equation as Least Squares), and (ii) assuming $\beta$ is known to find $sigma$ (should result in the estimator of Variance,

$E[(y_i-x_i^T\beta)^2]=\frac{1}{n}(Y-X\beta)^T(Y-X\beta)$

Then prediction given data, $(X_*,y)$ is,

$P(y|X_*,\beta,\sigma)=N(y|X_*^T\beta_{ML},\sigma^2)$

Advertising Data

Let $RSS=\varepsilon_1^2+\varepsilon_2^2+\cdots+\varepsilon_n^2=\sum_{1}^{n}\varepsilon_i^2$

Linear Models

Here is an example of multi-variate regression graph (2 regressors)

Simple Linear Regression Example

This example is from Applied Econometrics with R, Kleiber, Christian, Zeileis, Achim, Chapter 3
Journal Pricing Data will be used
Check the variable distributions (and conclude if transformation is needed)

subs	price	citeprice
Journal Data
Few Observations
14	123	5.8571429
59	20	0.9090909
17	443	20.1363636
2	276	12.5454545
96	295	12.2916667
15	344	14.3333333

Simple Linear Regression Example: Journal Pricing Data

Regression with Log-transformation (here is log-log) is better suited
The equation:

$ln(subs)_i=\beta_0+\beta_1 ln(citeprice)_i + \varepsilon_i$


Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.72478 -0.53609  0.03721  0.46619  1.84808 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.76621    0.05591   85.25   <2e-16 ***
log(citeprice) -0.53305    0.03561  -14.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7497 on 178 degrees of freedom
Multiple R-squared:  0.5573,    Adjusted R-squared:  0.5548 
F-statistic:   224 on 1 and 178 DF,  p-value: < 2.2e-16

Call: The formula with associated data. Here data=journal formula as given above equation.
Residuals (error terms etc): 5-number summary of $\varepsilon_i$
Coefficients: $\beta_0: \: intercept, \: \beta_1:\: ln(citeprice)$,
- Estimate: $\hat \beta_0$ and $\hat \beta_1$
- Std. Error: Standard Error for Estimates
- t value: t-stats for Estimations, $t−value = {\hat \beta_i \over Std.Error}$
- Pr(>|t|): p-value for hypothesis test, a small p-value rules out the association is due to chance
Residual Standard Error, $RSE=\sqrt{\frac{1}{n-2}RSS}$:

$\sqrt{Var(\varepsilon_i)}=\sqrt{\sigma^2_\varepsilon}=\hat \sigma_\varepsilon$

Even if the true parameters of the model is known and this linear model is the true model, this number shows how prediction deviates on average from the true value of dependent variable.

One may want to assess this number by calculating the percent error (ratio of RSE to mean value of dependent variable)

RSE is considered as lack of fit

Degrees of freedom: $n-p=number \: of \: observation-number \: of \: parameters$
R-squared: $R^2$ and Adjusted R-squared
F-statistics and its p-value

R-squared & F-test

$R^2$ is the percentage of the dependent variable’s variance explained by independent variable(s). It must be then,

$1-\frac{var(residuals)}{var(dependent variable)}$

where, var(.): variance

To calculate this by hand (in R):

Variance(residuals)=var(model$residual)=0.559

Variance(dependent)=var(ln(journal$subs))=1.2625

$1-\frac{var(residuals)}{var(dependent variable)}= 1-\frac{0.559}{1.2625}= 0.557$

$Adjusted \: R^2={R_{adj}^2 = 1 - [\frac{(1-R^2)(n-1)}{n-k-1}]}$

where, $n$ is the number of observations and $k$ is the number of independent variables. $R^2$ increases with addition of new variables. $R_{adj}^2$ increases with only if the addition of new independent variables that improves the model. This is done by penalizing with the number of independent variables.

Hence,

$n=number \: of \: observations=180$

$k=number \: of \: dependent \: variables=1$

${R_{adj}^2 = 1 - [\frac{(1-0.557)(180-1)}{180-1-1}]}=0.555$

F-Test

Full model and Reduced model can be compared using F-test
- In the above summary, Full model is the model where all variables specified
- Reduced model is the model with intercept only, or, $y=\beta_0+\varepsilon$
Let, $RSS_i$ is the $i^th$ model residual sum of squares ($RSS_0$ is the full model and $RSS_1$ is the alternative). Hypotheses are:

$H_0:\hat{\beta_1}=\hat{\beta_2}=\hat{\beta_3}=..=\hat{\beta_p}=0$

$H_1: \text{at least one } \hat{\beta_j} \neq 0, j\neq 0$

$\text{Reduced Model: } y_i=\beta_0+\varepsilon_i, \; df=n-1$

$\text{Full Model: } y_i=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p+\varepsilon_i,\; df=n-p+1$

$F = \left(\frac{RSS_0-RSS_1}{RSS_1}\right)\left(\frac{n-p_1}{p_1-p_0}\right)=\left(\frac{RSS_0-RSS_1}{p_1-p_0}\right)\left(\frac{n-p_1}{RSS_1}\right)$

$\implies F =\left(\frac{RSS_0-RSS_1}{p_1-p_0}\right)/\left(\frac{RSS_1}{n-p_1}\right)=\left(\frac{RSS_0-RSS_1}{df_0-df_1}\right)/\left(\frac{RSS_1}{df_1}\right)$

and in terms of $R^2$

$=\left(\frac{R^2}{1-R^2}\right)\left(\frac{n-p_1}{p_1-p_0}\right)$

[1] "F-Stat=224.037"

$=\left(\frac{0.5572546}{1-0.5572546}\right)\left(\frac{180-2}{2-1}\right) = 224.037$

Plot of Simple Linear Regression Example: Journal Pricing Data

Dummy Variable

Applied Econometrics with R, Kleiber, Christian, Zeileis, Achim, Chapter 3
Data set: CPS1988 data frame collected in the March 1988 Current Population Survey

wage (is the wage in dollars per week), education (years) and experience (years) are numbers, ethnicity is a factor with levels, Caucasian (cauc) and African-American (afam).
Experience = age - education - 6 (hence potential experience, not actual experience in this data)

Dummy Variable

Let’s drop subscript $i$ which is for individual and write a general model of interest in this examples as:

$ln(wage)=\beta_0 + \beta_1 \: experience+ \beta_2 \: experience^2 + \beta_3 \: education + \beta_4 \: ethnicity + \varepsilon$

Please note I() function shown in the summary below. Here I(experience^2) represents $experience^2$


Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education + 
    ethnicity, data = CPS1988)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9428 -0.3162  0.0580  0.3756  4.3830 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.321e+00  1.917e-02  225.38   <2e-16 ***
experience       7.747e-02  8.800e-04   88.03   <2e-16 ***
I(experience^2) -1.316e-03  1.899e-05  -69.31   <2e-16 ***
education        8.567e-02  1.272e-03   67.34   <2e-16 ***
ethnicityafam   -2.434e-01  1.292e-02  -18.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5839 on 28150 degrees of freedom
Multiple R-squared:  0.3347,    Adjusted R-squared:  0.3346 
F-statistic:  3541 on 4 and 28150 DF,  p-value: < 2.2e-16

All coefficients have the expected sign, and the corresponding variables are highly significant (not surprising in a large sample as the present one). Specifically, according to this specification, the return on education is 8.57% per year.
ethnicity appears as ethnicityafam (or better to say ethnicity==afam). But ethnicity may have a level of cauc (called as treatment contrast, treatment afam, ethnicityafam, is compared with the reference group cauc. dummy variable (or indicator variable) for the level afam in econometric jargon.

How to Compare Models

For any two nested models, this can be done by anova test (using the function anova())
For example to test the relevance of the variable ethnicity

Analysis of Variance Table

Model 1: log(wage) ~ experience + I(experience^2) + education
Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1  28151 9719.6                                  
2  28150 9598.6  1    121.02 354.91 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How to Compare Models

The Akaike’s Information Criterion, AIC, and the Bayesian Information Criterion - BIC (sometimes called as Schwarz criterion)

Number of observations is large

$AIC=-2\:ln(Likelihood) + 2\:p, \: n:number \: of \:observations$

Otherwise correction to get second-order AIC, AICc,

$AICc=-2\:ln(Likelihood) + 2\:p + \frac{2p(p+1)}{n -(p+1)}$ )

Bayesian Information Criterion

$BIC=-2\:ln(Likelihood) + ln(n)\:p$

Minimum AIC or BIC is selected

AIC with Ethnicity 49614.68

AIC without Ethnicity 49965.43

BIC with Ethnicity 49664.15

BIC with Ethnicity 50006.66

Partially Linear Models

In some models quadratic terms are common
It may be a good practice to model the role of this variable using more flexible way

$ln(wage)=\beta_0 + G(experience) + \beta_2 \: education + \beta_3 \: ethnicity + \varepsilon$

where $G()$ is unknown function to be estimated. For this purpose, regression splines may be used (We will discuss this part later)
The reason for the spline is that it fits some piece-wise polynomial to the data (here default is cubic polynomial fitted to experience variable). First the observations are split around proper quantiles and a polynomial regression function is fit to each piece.
The main idea is to model the non-linear part more flexible way to obtain the linear effect of other variables
Degrees of freedom (df parameter) is used to set the number of knots (in our example, df=5 then knots 5-parameters, 5-3=2) knots. It then puts the knots at 33.33% and 66.67% quantiles that evenly spaced the data
The choice of number of knots may be performed with fitting candidate models and assessing the AIC values

Partially Linear Models

Interpretation of the splines coefficients are not easy and skipped
For this specification the return on education is 8.82% per year


Call:
lm(formula = log(wage) ~ bs(experience, df = 5) + education + 
    ethnicity, data = CPS1988)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9315 -0.3079  0.0565  0.3672  3.9945 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              2.775582   0.056081   49.49   <2e-16 ***
bs(experience, df = 5)1  1.891673   0.075814   24.95   <2e-16 ***
bs(experience, df = 5)2  2.259468   0.046474   48.62   <2e-16 ***
bs(experience, df = 5)3  2.824582   0.070773   39.91   <2e-16 ***
bs(experience, df = 5)4  2.373082   0.065205   36.39   <2e-16 ***
bs(experience, df = 5)5  1.739341   0.119691   14.53   <2e-16 ***
education                0.088181   0.001258   70.07   <2e-16 ***
ethnicityafam           -0.248202   0.012725  -19.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5747 on 28147 degrees of freedom
Multiple R-squared:  0.3557,    Adjusted R-squared:  0.3555 
F-statistic:  2220 on 7 and 28147 DF,  p-value: < 2.2e-16

Partially Linear Models

	Regr.	Without Ethn.	With Splines
AIC	49614.6785	49965.4297	48720.0162
Education	0.0857	0.0874	0.0882

Weighted Least Squares (Reading Only)

Cross-section regressions are often plagued by heteroskedasticity
Weighted Least Squares (WLS) can be considered as one of the remedies
The fitting criterion:

$\sum_{i=1}^{n}(y_i-\beta_0-\beta_ix_i)^2$ to $\sum_{i=1}^{n}w_i(y_i-\beta_0-\beta_ix_i)^2$

where $w_i$ are the weights of each observations.

A possible solution for the weights is then assuming,

$E(\varepsilon_i^2|x_i,z_i)=h(z_i^T\gamma)$ where $h()$ called skedastic function.

Some popular specifications

$E(\varepsilon_i^2|x_i,z_i)=\sigma^2z_i^2\implies w_i=1/z_i$

WLS is a special case of Generalized Least Squares (GLS)
More often the form of scedastic function is not known and it is estimated from the data. This leads to feasible generalized least squares (FGLS).

For this example, we may start with

\[E(\varepsilon^2|x_i,z_i)=\sigma^2z_i^{\gamma_2}\]

\[E(\varepsilon_i^2|x_i,z_i)=\sigma^2 z_i^{\gamma_2}= e^{\gamma_1+\gamma_2 log(x_i)}\]

then, fit $ln(\varepsilon^2)=\gamma_1+\gamma_2ln(x_i)+v_i$, and set $w_i=\frac{1}{e^{[\hat \gamma_1+\hat \gamma_2ln(x_i))]}}$

And one can also add and iterative process so that it starts with the coefficient obtained above, then refit the model using new error terms until the change in coefficient becomes so small between each iteration.

Linear Models (Regressions)

Preliminary Readings

Learning Objectives

Linear Models (Linear Regression)

An Example, Salary & Age

Advertising Data (ISLR Book)

How to Estimate: Least Square Estimation

How to Estimate: Maximum Likelihood Estimation

An easy example for illustration (\(\mu\))

The likelihood for linear regression

Advertising Data

Linear Models

Simple Linear Regression Example

Simple Linear Regression Example: Journal Pricing Data

R-squared & F-test

Plot of Simple Linear Regression Example: Journal Pricing Data

Dummy Variable

Dummy Variable

How to Compare Models

How to Compare Models

Partially Linear Models

Partially Linear Models

Partially Linear Models

Weighted Least Squares (Reading Only)