Evaluating Linear Models (Regressions)

I. Ozkan

Spring 2025

Preliminary Readings

Learning Objectives

Keywords:

Linear Models (Linear Regression)

Steps involve Regression analysis:

Some Questions: Linear Models

Are the assumptions valid given data

Linear Relationship: Simple Case

Which one of the model is a better fit to the data, this one

Linear Relationship: Simple Case

Let’s see both in one graph


=======================================================================
                                    Dependent variable:                
                    ---------------------------------------------------
                                             y                         
                               (1)                       (2)           
-----------------------------------------------------------------------
x                           3.587***                   0.662**         
                             (0.088)                   (0.278)         
                                                                       
I(x2)                                                 0.266***         
                                                       (0.025)         
                                                                       
Constant                    3.333***                  9.542***         
                             (0.536)                   (0.678)         
                                                                       
-----------------------------------------------------------------------
Observations                   91                        91            
R2                            0.949                     0.978          
Adjusted R2                   0.949                     0.978          
Residual Std. Error      2.202 (df = 89)           1.457 (df = 88)     
F Statistic         1,665.464*** (df = 1; 89) 1,961.669*** (df = 2; 88)
=======================================================================
Note:                                       *p<0.1; **p<0.05; ***p<0.01

Coefficients

Residuals

Some Major Assumptions

\(\varepsilon_i=y_i-\hat y_i\)

\(\hat y_i=X \hat \beta\)

\(E[\varepsilon]=\bar \varepsilon=\frac{1}{n}\sum_{i}\varepsilon_i=0\)
\(V[\varepsilon]=s^2=\frac{1}{n-p-1}\sum_{i}\varepsilon_i^2\)

An Example: Advertising Data

Advertising Data
TV radio newspaper sales
230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9
8.7 48.9 75.0 7.2
57.5 32.8 23.5 11.8
120.2 19.6 11.6 13.2
8.6 2.1 1.0 4.8
199.8 2.6 21.2 10.6


Call:
lm(formula = sales ~ ., data = advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Residuals

\(E[\varepsilon]=\bar \varepsilon=\frac{1}{n}\sum_{i}\varepsilon_i=0\)

Click here to read more

To diagnose, visually check plots of residuals against fitted values or predictors for constant variance (at least for now, the Breusch-Pagan test against heteroscedaticity can be used)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 5.355982, Df = 1, p = 0.020651

Omitting variables may result in biased estimation of parameters and standard errors

Statistical tests as well as visual tests of independence are usually insufficient

Data Collection, sample selection, repeated measures, spatial/temporal observations, etc. are important questions

Both visual (Q-Q plot) and statistical tests (Shapiro test for example for relatively small samples, \(n \leq 5000\)) can be used to diagnose whether error terms are normally distributed

Get Standardized residuals (rstandard()) and perform shapiro test (shapiro.test())

Standardized residuals, \(z_i=\frac{\varepsilon_i}{\sqrt{MSE}}, \; z_i \sim N(0,1)\)


    Shapiro-Wilk normality test

data:  rstandard(lm1)
W = 0.91662, p-value = 3.312e-09

Q-Q Plot

Skewness of standardized residuals of the advertising model is -1.332

Q-Q plot for simulated random variable that follows normal distribution

The Shapiro-Wilk test


    Shapiro-Wilk normality test

data:  x
W = 0.9995, p-value = 0.9086

Skewed distributions (Click here to read more)

Positively Skewed

Negatively Skewed

Fat Tails (example, t-distribution with 4 degrees of freedom)

With 2 peaks

With 3 peaks

Skew

Fit GLM model (not a part of this course)

Transform the dependent variable

Multiple peaks

Check for omitted categorical variable(s)

Fat-tailed

Transform the dependent variable

No Multicollinearity

let, \(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i\)

If the correlation between \(x_{1i} \; ve \; x_{2i}\) are very high, the estimate of \(\beta_1\) is consistent and unbiased but it has a large variance

\(\sigma^2_{\hat\beta_1} = \frac{1}{n} \left( \frac{1}{1-\rho^2_{X_1,X_2}} \right) \frac{\sigma^2_\varepsilon}{\sigma^2_{X_1}}\)

To see the effect of high correlation, let’s have two simulation examples where the correlations between these variables are 0.25 and 0.85:

1-

\[x_i = (x_{1i}, x_{2i}) \overset{i.i.d.}{\sim} \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 2.5 \\ 2.5 & 10 \end{pmatrix} \right]\]

\[\rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{2.5}{10} = 0.25\]
Here is the simulated Covariance Matrix and the joint density estimation of \(\hat{\beta_i}\)

hat_beta_1 hat_beta_2 
0.05674375 0.05712459 

2-

\[x_i = (x_{1i}, x_{2i}) \overset{i.i.d.}{\sim} \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 8.5 \\ 8.5 & 10 \end{pmatrix} \right]\]

\[\rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{8.5}{10} = 0.85\]

hat_beta_1 hat_beta_2 
 0.1904949  0.1909056 

Variance Inflation Factors

\(VIF_j=\frac{1}{1-R_{j}^{2}}\)

where \(R_{j}^{2}\), is obtained from a model where one predictor, \(j.\), is regressed on all of the other predictors

\(VIF_j=1\) implies perfectly uncorrelated predictors where \(VIF_j \to \infty\) for perfectly correlated predictors

Advertising Data Example:

No Outliers

Extreme Values

They can be identified as points with high residuals

Optional: Studentized Residuals

\(MSE\) is the approximation of the variance of \(\varepsilon_i\). One can improve this by scaling,

\(V(\varepsilon_i)=\sigma^2(1-h_i)\)

where, \(h_i\) is the \(i^{th}\) element on the diagonal of the hat matrix, \(H=X(X^TX)^{-1}X^T\)

Studentized residual, \(r_i=\frac{\varepsilon_i}{\sqrt{MSE(1-h_i)}}\)
\(r_i \sim t-distribution\) with \(n-p-1\) degrees of freedom. One can consider values outside the range \([-2, 2]\) as potential outliers

rstudent() function helps us to obtain studentized residuals

In our advertising model, the potential outliers (extreme values of residuals) are

TV radio newspaper sales yhat res_sqrt res res_stud
8.7 48.9 75.0 7.2 12.478 1.791 -5.278 -3.288
262.9 3.5 19.5 12.0 15.610 1.472 -3.610 -2.189
290.7 4.1 8.5 12.8 17.007 1.592 -4.207 -2.571
5.4 29.9 9.4 5.3 8.813 1.454 -3.513 -2.132
7.8 38.9 50.6 6.6 10.577 1.547 -3.977 -2.422
0.7 39.6 8.7 1.6 10.428 2.310 -8.828 -5.758
276.7 2.3 23.7 11.8 16.011 1.592 -4.211 -2.570

Leverage

*: Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression Diagnostics: Identifying influential data and sources of collinearity. Wiley. https://doi.org/10.1002/0471725153

hatvalues() can be used for leverage

Advertising data model then have a cut-off value approximately \(Leverage_i > \frac {2k} {n}=0.04\)

Here is a table showing these observations

sales yhat res_sqrt res_stud lev
7.2 12.478 1.791 -3.288 0.047
12.5 12.824 0.448 -0.200 0.086
25.4 23.406 1.099 1.209 0.040
8.7 11.858 1.389 -1.943 0.057
23.8 23.242 0.586 0.342 0.070
24.7 22.255 1.218 1.488 0.044
11.9 14.224 1.195 -1.433 0.069

Influence

Measures of Influences that are used widely:

where \(r_i\) is studentized residual, \(h_i\) is the leverage

Cut-off value \(|DFFITS_i| > 2 \times \sqrt{ \frac {k} {n} }\) where \(k\) is the number of parameters

dffits() function can be used

Here is a table showing ten of these observations

sales yhat res_sqrt res_stud_large lev_large dffits
9.3 12.308 1.349 0 0 -0.370
7.2 12.478 1.791 1 1 -0.734
12.0 15.610 1.472 1 0 -0.344
12.8 17.007 1.592 1 0 -0.459
8.7 11.858 1.389 0 1 -0.478
5.3 8.813 1.454 1 0 -0.353
6.6 10.577 1.547 1 0 -0.403
24.7 22.255 1.218 0 1 0.318
1.6 10.428 2.310 1 0 -1.127
12.7 15.578 1.317 0 0 -0.311

Cut-off value \(|DFBETAS_i| > \frac {2} {\sqrt{n}}\)

Here is a table showing some of these observations

sales yhat res_sqrt res_stud_large lev_large dffits_large dfb_(Intercept) dfb_TV dfb_radio dfb_newspaper
9.3 12.308 1.349 0 0 1 -0.002 0.219 -0.135 -0.184
7.2 12.478 1.791 1 1 1 0.025 0.423 -0.272 -0.382
8.7 11.858 1.389 0 1 1 0.033 0.240 -0.073 -0.345
5.3 8.813 1.454 1 0 1 -0.234 0.250 -0.139 0.175
1.6 10.428 2.310 1 0 1 -0.533 0.711 -0.673 0.591
5.7 8.449 1.287 0 0 1 -0.204 0.188 -0.100 0.171

Model Selection



Will be covered in Machine Learning Course