I. Ozkan
Spring 2025
Keywords:
Inputs: independent variable, covariates, predictors, features, regressors.
Output: dependent variable, variates, labels, regressand.
Review of Basics of Linear Regression (we do not go into the details, refer to econometrics course notes).
Steps involve Regression analysis:
Model Formulation (Model Selection will not be discussed)
Model Estimation
Model Evaluation
Model Use
Is there a [linear] relationship between dependent and independent variable
How strong is this relationships
How accurately we can predict the dependent variable given the values of independent variable
…
Are the assumptions valid given data
Which one of the model is a better fit to the data, this one
Let’s see both in one graph
=======================================================================
Dependent variable:
---------------------------------------------------
y
(1) (2)
-----------------------------------------------------------------------
x 3.587*** 0.662**
(0.088) (0.278)
I(x2) 0.266***
(0.025)
Constant 3.333*** 9.542***
(0.536) (0.678)
-----------------------------------------------------------------------
Observations 91 91
R2 0.949 0.978
Adjusted R2 0.949 0.978
Residual Std. Error 2.202 (df = 89) 1.457 (df = 88)
F Statistic 1,665.464*** (df = 1; 89) 1,961.669*** (df = 2; 88)
=======================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Linear Relationship
Error term \(\varepsilon\) has zero mean \(E[\varepsilon]=0\)
Error term \(\varepsilon\) has constant variance \(V(\varepsilon_i)=s^2\)
Errors are uncorrelated \(Cov(\varepsilon_i,\varepsilon_j)=0 \; for \; i \neq j\)
Errors are Normally Distributed \(\varepsilon \sim N(0,s^2)\) (or we have enough observations)
\(\varepsilon_i=y_i-\hat y_i\)
\(\hat y_i=X \hat \beta\)
\(E[\varepsilon]=\bar
\varepsilon=\frac{1}{n}\sum_{i}\varepsilon_i=0\)
\(V[\varepsilon]=s^2=\frac{1}{n-p-1}\sum_{i}\varepsilon_i^2\)
TV | radio | newspaper | sales |
---|---|---|---|
230.1 | 37.8 | 69.2 | 22.1 |
44.5 | 39.3 | 45.1 | 10.4 |
17.2 | 45.9 | 69.3 | 9.3 |
151.5 | 41.3 | 58.5 | 18.5 |
180.8 | 10.8 | 58.4 | 12.9 |
8.7 | 48.9 | 75.0 | 7.2 |
57.5 | 32.8 | 23.5 | 11.8 |
120.2 | 19.6 | 11.6 | 13.2 |
8.6 | 2.1 | 1.0 | 4.8 |
199.8 | 2.6 | 21.2 | 10.6 |
Call:
lm(formula = sales ~ ., data = advertising)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
radio 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Linear Relationship: Visual Clues (at least for now, Ramsey Regression Equation Specification Error Test, RESET, test may be used)
Error term \(\varepsilon\) has zero mean \(E[\varepsilon]=0\)
\(E[\varepsilon]=\bar \varepsilon=\frac{1}{n}\sum_{i}\varepsilon_i=0\)
To diagnose, visually check plots of residuals against fitted values or predictors for constant variance (at least for now, the Breusch-Pagan test against heteroscedaticity can be used)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 5.355982, Df = 1, p = 0.020651
Omitting variables may result in biased estimation of parameters and standard errors
Statistical tests as well as visual tests of independence are usually insufficient
Data Collection, sample selection, repeated measures, spatial/temporal observations, etc. are important questions
Both visual (Q-Q plot) and statistical tests (Shapiro test for example for relatively small samples, \(n \leq 5000\)) can be used to diagnose whether error terms are normally distributed
Get Standardized residuals
(rstandard()
) and perform shapiro test
(shapiro.test()
)
Standardized residuals, \(z_i=\frac{\varepsilon_i}{\sqrt{MSE}}, \; z_i \sim N(0,1)\)
Shapiro-Wilk normality test
data: rstandard(lm1)
W = 0.91662, p-value = 3.312e-09
Q-Q Plot
Skewness of standardized residuals of the advertising model is -1.332
Q-Q plot for simulated random variable that follows normal distribution
The Shapiro-Wilk test
Shapiro-Wilk normality test
data: x
W = 0.9995, p-value = 0.9086
Skewed distributions (Click here to read more)
Positively Skewed
Negatively Skewed
Fat Tails (example, t-distribution with 4 degrees of freedom)
With 2 peaks
With 3 peaks
Skew
Fit GLM model (not a part of this course)
Transform the dependent variable
Multiple peaks
Check for omitted categorical variable(s)
Fat-tailed
Transform the dependent variable
The information contained in one variables should not be contained by other variable(s) significantly
Multicollinearity results in increased standard errors
Variance Inflation Factor (VIF) can be used to diagnose (VIF value should be less than 4 or 5 most of the models)
These variables can be combined (factors), some of them may be dropped
let, \(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i\)
If the correlation between \(x_{1i} \; ve \; x_{2i}\) are very high, the estimate of \(\beta_1\) is consistent and unbiased but it has a large variance
\(\sigma^2_{\hat\beta_1} = \frac{1}{n} \left( \frac{1}{1-\rho^2_{X_1,X_2}} \right) \frac{\sigma^2_\varepsilon}{\sigma^2_{X_1}}\)
To see the effect of high correlation, let’s have two simulation examples where the correlations between these variables are 0.25 and 0.85:
1-
\[x_i = (x_{1i}, x_{2i}) \overset{i.i.d.}{\sim} \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 2.5 \\ 2.5 & 10 \end{pmatrix} \right]\]
\[\rho_{X_1,X_2} =
\frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{2.5}{10} =
0.25\]
Here is the simulated Covariance Matrix and the joint density estimation
of \(\hat{\beta_i}\)
hat_beta_1 hat_beta_2
0.05674375 0.05712459
2-
\[x_i = (x_{1i}, x_{2i}) \overset{i.i.d.}{\sim} \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 8.5 \\ 8.5 & 10 \end{pmatrix} \right]\]
\[\rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{8.5}{10} = 0.85\]
hat_beta_1 hat_beta_2
0.1904949 0.1909056
Variance Inflation Factors
\(VIF_j=\frac{1}{1-R_{j}^{2}}\)
where \(R_{j}^{2}\), is obtained from a model where one predictor, \(j.\), is regressed on all of the other predictors
\(VIF_j=1\) implies perfectly uncorrelated predictors where \(VIF_j \to \infty\) for perfectly correlated predictors
Advertising Data Example:
Outliers have disproportionately large influence on the predicted values and/or model parameter estimates
They may be classified into three types:
Extreme Values
They can be identified as points with high residuals
Optional: Studentized Residuals
\(MSE\) is the approximation of the variance of \(\varepsilon_i\). One can improve this by scaling,
\(V(\varepsilon_i)=\sigma^2(1-h_i)\)
where, \(h_i\) is the \(i^{th}\) element on the diagonal of the hat matrix, \(H=X(X^TX)^{-1}X^T\)
Studentized residual, \(r_i=\frac{\varepsilon_i}{\sqrt{MSE(1-h_i)}}\)
\(r_i \sim t-distribution\) with \(n-p-1\) degrees of freedom. One can
consider values outside the range \([-2,
2]\) as potential outliers
rstudent()
function helps us to obtain studentized
residuals
In our advertising model, the potential outliers (extreme values of residuals) are
TV | radio | newspaper | sales | yhat | res_sqrt | res | res_stud |
---|---|---|---|---|---|---|---|
8.7 | 48.9 | 75.0 | 7.2 | 12.478 | 1.791 | -5.278 | -3.288 |
262.9 | 3.5 | 19.5 | 12.0 | 15.610 | 1.472 | -3.610 | -2.189 |
290.7 | 4.1 | 8.5 | 12.8 | 17.007 | 1.592 | -4.207 | -2.571 |
5.4 | 29.9 | 9.4 | 5.3 | 8.813 | 1.454 | -3.513 | -2.132 |
7.8 | 38.9 | 50.6 | 6.6 | 10.577 | 1.547 | -3.977 | -2.422 |
0.7 | 39.6 | 8.7 | 1.6 | 10.428 | 2.310 | -8.828 | -5.758 |
276.7 | 2.3 | 23.7 | 11.8 | 16.011 | 1.592 | -4.211 | -2.570 |
Leverage
It is a measure of distance between individual values of a predictor and other values of the predictor
Points with high leverage have the potential to influence our model estimates
Values range from 0 to 1 where \(Leverage_i > \frac {2k} {n}\) can be considered large value* (k is the number of parameters, n is the sample size)
*: Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression Diagnostics: Identifying influential data and sources of collinearity. Wiley. https://doi.org/10.1002/0471725153
hatvalues()
can be used for leverage
Advertising data model then have a cut-off value approximately \(Leverage_i > \frac {2k} {n}=0.04\)
Here is a table showing these observations
sales | yhat | res_sqrt | res_stud | lev |
---|---|---|---|---|
7.2 | 12.478 | 1.791 | -3.288 | 0.047 |
12.5 | 12.824 | 0.448 | -0.200 | 0.086 |
25.4 | 23.406 | 1.099 | 1.209 | 0.040 |
8.7 | 11.858 | 1.389 | -1.943 | 0.057 |
23.8 | 23.242 | 0.586 | 0.342 | 0.070 |
24.7 | 22.255 | 1.218 | 1.488 | 0.044 |
11.9 | 14.224 | 1.195 | -1.433 | 0.069 |
Influence
Leverage and residuals are abstract outlier metrics
Substantive impact of an observation in model, Influence, are more important
Influence is a measure of how much an observation affects our model estimates
Measures of Influences that are used widely:
Cook’s Distance
DFFITS: standardized measure of how much the prediction for a
given observation would change if it were deleted from the model
\(DFFITS_i = r_i \times \sqrt{ \frac { h_i } {
1 - h_i } }\)
where \(r_i\) is studentized residual, \(h_i\) is the leverage
Cut-off value \(|DFFITS_i| > 2 \times \sqrt{ \frac {k} {n} }\) where \(k\) is the number of parameters
dffits()
function can be used
Here is a table showing ten of these observations
sales | yhat | res_sqrt | res_stud_large | lev_large | dffits |
---|---|---|---|---|---|
9.3 | 12.308 | 1.349 | 0 | 0 | -0.370 |
7.2 | 12.478 | 1.791 | 1 | 1 | -0.734 |
12.0 | 15.610 | 1.472 | 1 | 0 | -0.344 |
12.8 | 17.007 | 1.592 | 1 | 0 | -0.459 |
8.7 | 11.858 | 1.389 | 0 | 1 | -0.478 |
5.3 | 8.813 | 1.454 | 1 | 0 | -0.353 |
6.6 | 10.577 | 1.547 | 1 | 0 | -0.403 |
24.7 | 22.255 | 1.218 | 0 | 1 | 0.318 |
1.6 | 10.428 | 2.310 | 1 | 0 | -1.127 |
12.7 | 15.578 | 1.317 | 0 | 0 | -0.311 |
DFBETAS: standardized differences between regression coefficients in a model with a given observation, and a model without that observation
DFBETAS are standardized by the standard error of the coefficient. A model, then, has \(n \times k \times DFBETAS\), one for each combination of observations and predictors
Cut-off value \(|DFBETAS_i| > \frac {2} {\sqrt{n}}\)
Here is a table showing some of these observations
sales | yhat | res_sqrt | res_stud_large | lev_large | dffits_large | dfb_(Intercept) | dfb_TV | dfb_radio | dfb_newspaper |
---|---|---|---|---|---|---|---|---|---|
9.3 | 12.308 | 1.349 | 0 | 0 | 1 | -0.002 | 0.219 | -0.135 | -0.184 |
7.2 | 12.478 | 1.791 | 1 | 1 | 1 | 0.025 | 0.423 | -0.272 | -0.382 |
8.7 | 11.858 | 1.389 | 0 | 1 | 1 | 0.033 | 0.240 | -0.073 | -0.345 |
5.3 | 8.813 | 1.454 | 1 | 0 | 1 | -0.234 | 0.250 | -0.139 | 0.175 |
1.6 | 10.428 | 2.310 | 1 | 0 | 1 | -0.533 | 0.711 | -0.673 | 0.591 |
5.7 | 8.449 | 1.287 | 0 | 0 | 1 | -0.204 | 0.188 | -0.100 | 0.171 |
Will be covered in Machine Learning Course