Evaluating Linear Models (Regressions)

I. Ozkan

Spring 2025

Learning Objectives

Part of Supervised Learning (input(s)-output):

Keywords:

Inputs: independent variable, covariates, predictors, features, regressors.
Output: dependent variable, variates, labels, regressand.
Review of Basics of Linear Regression (we do not go into the details, refer to econometrics course notes).

Linear Models (Linear Regression)

Steps involve Regression analysis:

Model Formulation (Model Selection will not be discussed)
Model Estimation
Model Evaluation
Model Use

Some Questions: Linear Models

Is there a [linear] relationship between dependent and independent variable
How strong is this relationships
How accurately we can predict the dependent variable given the values of independent variable
…

Are the assumptions valid given data

Linear Relationship: Simple Case

Which one of the model is a better fit to the data, this one

Or that one

Linear Relationship: Simple Case

Let’s see both in one graph

Model Fits


=======================================================================
                                    Dependent variable:                
                    ---------------------------------------------------
                                             y                         
                               (1)                       (2)           
-----------------------------------------------------------------------
x                           3.587***                   0.662**         
                             (0.088)                   (0.278)         
                                                                       
I(x2)                                                 0.266***         
                                                       (0.025)         
                                                                       
Constant                    3.333***                  9.542***         
                             (0.536)                   (0.678)         
                                                                       
-----------------------------------------------------------------------
Observations                   91                        91            
R2                            0.949                     0.978          
Adjusted R2                   0.949                     0.978          
Residual Std. Error      2.202 (df = 89)           1.457 (df = 88)     
F Statistic         1,665.464*** (df = 1; 89) 1,961.669*** (df = 2; 88)
=======================================================================
Note:                                       *p<0.1; **p<0.05; ***p<0.01

Residuals

Before comparing the models, let’s see the residuals

Some Major Assumptions

Linear Relationship
Error term \(\varepsilon\) has zero mean \(E[\varepsilon]=0\)
Error term \(\varepsilon\) has constant variance \(V(\varepsilon_i)=s^2\)
Errors are uncorrelated \(Cov(\varepsilon_i,\varepsilon_j)=0 \; for \; i \neq j\)
Errors are Normally Distributed \(\varepsilon \sim N(0,s^2)\) (or we have enough observations)

\(\varepsilon_i=y_i-\hat y_i\)

\(\hat y_i=X \hat \beta\)

\(E[\varepsilon]=\bar \varepsilon=\frac{1}{n}\sum_{i}\varepsilon_i=0\)
\(V[\varepsilon]=s^2=\frac{1}{n-p-1}\sum_{i}\varepsilon_i^2\)

An Example: Advertising Data

Let’s look at some observations with a plot of all data.

Advertising Data
TV	radio	newspaper	sales
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5
180.8	10.8	58.4	12.9
8.7	48.9	75.0	7.2
57.5	32.8	23.5	11.8
120.2	19.6	11.6	13.2
8.6	2.1	1.0	4.8
199.8	2.6	21.2	10.6


Call:
lm(formula = sales ~ ., data = advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Residuals

Linear Relationship: Visual Clues (at least for now, Ramsey Regression Equation Specification Error Test, RESET, test may be used)
Error term \(\varepsilon\) has zero mean \(E[\varepsilon]=0\)

\(E[\varepsilon]=\bar \varepsilon=\frac{1}{n}\sum_{i}\varepsilon_i=0\)

Error term \(\varepsilon\) has constant variance \(V(\varepsilon_i)=s^2\)

Click here to read more

Homoscedasticity is necessary to calculate accurate standard errors for parameter estimates

To diagnose, visually check plots of residuals against fitted values or predictors for constant variance (at least for now, the Breusch-Pagan test against heteroscedaticity can be used)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 5.355982, Df = 1, p = 0.020651

Plot the residuals variance against the fitted values and against each of the predictors

Errors are independent after controlling predictors \(Cov(\varepsilon_i,\varepsilon_j|X)=0 \; for \; i \neq j\)

Omitting variables may result in biased estimation of parameters and standard errors

Statistical tests as well as visual tests of independence are usually insufficient

Data Collection, sample selection, repeated measures, spatial/temporal observations, etc. are important questions

Errors are Normally Distributed \(\varepsilon \sim N(0,s^2)\) (or we have enough observations)

Both visual (Q-Q plot) and statistical tests (Shapiro test for example for relatively small samples, \(n \leq 5000\)) can be used to diagnose whether error terms are normally distributed

Get Standardized residuals (rstandard()) and perform shapiro test (shapiro.test())

Standardized residuals, \(z_i=\frac{\varepsilon_i}{\sqrt{MSE}}, \; z_i \sim N(0,1)\)


    Shapiro-Wilk normality test

data:  rstandard(lm1)
W = 0.91662, p-value = 3.312e-09

Q-Q Plot

Skewness of standardized residuals of the advertising model is -1.332

Q-Q plot for simulated random variable that follows normal distribution

The Shapiro-Wilk test


    Shapiro-Wilk normality test

data:  x
W = 0.9995, p-value = 0.9086

Skewed distributions (Click here to read more)

Positively Skewed

Negatively Skewed

Fat Tails (example, t-distribution with 4 degrees of freedom)

With 2 peaks

With 3 peaks

What to do

Skew

Fit GLM model (not a part of this course)

Transform the dependent variable

Multiple peaks

Check for omitted categorical variable(s)

Fat-tailed

Transform the dependent variable

No Multicollinearity

The information contained in one variables should not be contained by other variable(s) significantly
Multicollinearity results in increased standard errors
Variance Inflation Factor (VIF) can be used to diagnose (VIF value should be less than 4 or 5 most of the models)
These variables can be combined (factors), some of them may be dropped

let, \(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i\)

If the correlation between \(x_{1i} \; ve \; x_{2i}\) are very high, the estimate of \(\beta_1\) is consistent and unbiased but it has a large variance

\(\sigma^2_{\hat\beta_1} = \frac{1}{n} \left( \frac{1}{1-\rho^2_{X_1,X_2}} \right) \frac{\sigma^2_\varepsilon}{\sigma^2_{X_1}}\)

To see the effect of high correlation, let’s have two simulation examples where the correlations between these variables are 0.25 and 0.85:

\[x_i = (x_{1i}, x_{2i}) \overset{i.i.d.}{\sim} \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 2.5 \\ 2.5 & 10 \end{pmatrix} \right]\]

\[\rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{2.5}{10} = 0.25\]
Here is the simulated Covariance Matrix and the joint density estimation of \(\hat{\beta_i}\)

hat_beta_1 hat_beta_2 
0.05674375 0.05712459

\[x_i = (x_{1i}, x_{2i}) \overset{i.i.d.}{\sim} \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 8.5 \\ 8.5 & 10 \end{pmatrix} \right]\]

\[\rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{8.5}{10} = 0.85\]

hat_beta_1 hat_beta_2 
 0.1904949  0.1909056

Variance Inflation Factors

\(VIF_j=\frac{1}{1-R_{j}^{2}}\)

where \(R_{j}^{2}\), is obtained from a model where one predictor, \(j.\), is regressed on all of the other predictors

\(VIF_j=1\) implies perfectly uncorrelated predictors where \(VIF_j \to \infty\) for perfectly correlated predictors

Advertising Data Example:

No Outliers

Outliers have disproportionately large influence on the predicted values and/or model parameter estimates
They may be classified into three types:
- Extreme values
- Leverage
- Influence,
  - Influence on predicted values
  - Influence on parameter estimates

Extreme Values

They can be identified as points with high residuals

Optional: Studentized Residuals

\(MSE\) is the approximation of the variance of \(\varepsilon_i\). One can improve this by scaling,

\(V(\varepsilon_i)=\sigma^2(1-h_i)\)

where, \(h_i\) is the \(i^{th}\) element on the diagonal of the hat matrix, \(H=X(X^TX)^{-1}X^T\)

Studentized residual, \(r_i=\frac{\varepsilon_i}{\sqrt{MSE(1-h_i)}}\)
\(r_i \sim t-distribution\) with \(n-p-1\) degrees of freedom. One can consider values outside the range \([-2, 2]\) as potential outliers

rstudent() function helps us to obtain studentized residuals

In our advertising model, the potential outliers (extreme values of residuals) are

TV	radio	newspaper	sales	yhat	res_sqrt	res	res_stud
8.7	48.9	75.0	7.2	12.478	1.791	-5.278	-3.288
262.9	3.5	19.5	12.0	15.610	1.472	-3.610	-2.189
290.7	4.1	8.5	12.8	17.007	1.592	-4.207	-2.571
5.4	29.9	9.4	5.3	8.813	1.454	-3.513	-2.132
7.8	38.9	50.6	6.6	10.577	1.547	-3.977	-2.422
0.7	39.6	8.7	1.6	10.428	2.310	-8.828	-5.758
276.7	2.3	23.7	11.8	16.011	1.592	-4.211	-2.570

Leverage

It is a measure of distance between individual values of a predictor and other values of the predictor
Points with high leverage have the potential to influence our model estimates
Values range from 0 to 1 where \(Leverage_i > \frac {2k} {n}\) can be considered large value* (k is the number of parameters, n is the sample size)

*: Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression Diagnostics: Identifying influential data and sources of collinearity. Wiley. https://doi.org/10.1002/0471725153

hatvalues() can be used for leverage

Advertising data model then have a cut-off value approximately \(Leverage_i > \frac {2k} {n}=0.04\)

Here is a table showing these observations

sales	yhat	res_sqrt	res_stud	lev
7.2	12.478	1.791	-3.288	0.047
12.5	12.824	0.448	-0.200	0.086
25.4	23.406	1.099	1.209	0.040
8.7	11.858	1.389	-1.943	0.057
23.8	23.242	0.586	0.342	0.070
24.7	22.255	1.218	1.488	0.044
11.9	14.224	1.195	-1.433	0.069

Influence

Leverage and residuals are abstract outlier metrics
Substantive impact of an observation in model, Influence, are more important
Influence is a measure of how much an observation affects our model estimates

Measures of Influences that are used widely:

Cook’s Distance
DFFITS: standardized measure of how much the prediction for a given observation would change if it were deleted from the model
\(DFFITS_i = r_i \times \sqrt{ \frac { h_i } { 1 - h_i } }\)

where \(r_i\) is studentized residual, \(h_i\) is the leverage

Cut-off value \(|DFFITS_i| > 2 \times \sqrt{ \frac {k} {n} }\) where \(k\) is the number of parameters

dffits() function can be used

Here is a table showing ten of these observations

sales	yhat	res_sqrt	res_stud_large	lev_large	dffits
9.3	12.308	1.349	0	0	-0.370
7.2	12.478	1.791	1	1	-0.734
12.0	15.610	1.472	1	0	-0.344
12.8	17.007	1.592	1	0	-0.459
8.7	11.858	1.389	0	1	-0.478
5.3	8.813	1.454	1	0	-0.353
6.6	10.577	1.547	1	0	-0.403
24.7	22.255	1.218	0	1	0.318
1.6	10.428	2.310	1	0	-1.127
12.7	15.578	1.317	0	0	-0.311

DFBETAS: standardized differences between regression coefficients in a model with a given observation, and a model without that observation
DFBETAS are standardized by the standard error of the coefficient. A model, then, has \(n \times k \times DFBETAS\), one for each combination of observations and predictors

Cut-off value \(|DFBETAS_i| > \frac {2} {\sqrt{n}}\)

Here is a table showing some of these observations

sales	yhat	res_sqrt	res_stud_large	lev_large	dffits_large	dfb_(Intercept)	dfb_TV	dfb_radio	dfb_newspaper
9.3	12.308	1.349	0	0	1	-0.002	0.219	-0.135	-0.184
7.2	12.478	1.791	1	1	1	0.025	0.423	-0.272	-0.382
8.7	11.858	1.389	0	1	1	0.033	0.240	-0.073	-0.345
5.3	8.813	1.454	1	0	1	-0.234	0.250	-0.139	0.175
1.6	10.428	2.310	1	0	1	-0.533	0.711	-0.673	0.591
5.7	8.449	1.287	0	0	1	-0.204	0.188	-0.100	0.171

Model Selection

Will be covered in Machine Learning Course

Evaluating Linear Models (Regressions)

Preliminary Readings

Learning Objectives

Linear Models (Linear Regression)

Some Questions: Linear Models

Linear Relationship: Simple Case

Linear Relationship: Simple Case

Coefficients

Residuals

Some Major Assumptions

An Example: Advertising Data

Residuals

No Multicollinearity

No Outliers

Model Selection