I. Ozkan
Spring 2025
Keywords:
Inputs: independent variable, covariates, predictors, features, regressors.
Output: dependent variable, variates, labels, regressand.
Review of Basics of Linear Regression (we do not go into the details, refer to econometrics course notes).
Linear Approximation may be used to approximate many process successfully
Can be combined to create a larger model (system)
Possible to obtain analytical solution (may not the case for larger system)
Linear regression provides an introduction for many of the core machine learning concepts
TV | radio | newspaper | sales |
---|---|---|---|
230.1 | 37.8 | 69.2 | 22.1 |
44.5 | 39.3 | 45.1 | 10.4 |
17.2 | 45.9 | 69.3 | 9.3 |
151.5 | 41.3 | 58.5 | 18.5 |
180.8 | 10.8 | 58.4 | 12.9 |
8.7 | 48.9 | 75.0 | 7.2 |
57.5 | 32.8 | 23.5 | 11.8 |
120.2 | 19.6 | 11.6 | 13.2 |
8.6 | 2.1 | 1.0 | 4.8 |
199.8 | 2.6 | 21.2 | 10.6 |
Least Square Estimation
Let:
\(\{(x_1, y_1), (x_2,y_2), \cdots , (x_n, y_n)\}\) are the \(n\) input-output pairs. \(x_i \in \mathbb{R^k}\) is a vector with \(k\) attributes.
Output (target), \(y_i \in \mathbb{R}\) is univariate.
Then the linear model is;
\[y_i=\theta_0+\theta_1x_1+ \theta_2x_2+ \cdots + \theta_kx_k+\varepsilon_i\]
In some textbooks this is written with \(\beta\) as:
\[y_i=\beta_0+\beta_1x_1+ \beta_2x_2+ \cdots + \beta_kx_k+\varepsilon_i\]
The estimation problem is then to find \(\hat \theta\) (or \(\hat \beta\)) with some regular assumptions (lets skip them and refer to Econometrics course).
Find parameter, \(\hat \theta\), vector by minimizing Loss (cost, error etc) function, \(L(\theta)\). Loss function can be written as:
\[L(\theta)=\sum_{}^{}\varepsilon_i^2\]
\[L(\theta)=\sum_{}^{}[y_i-(\hat\theta_0+\hat\theta_1x_1+ \hat\theta_2x_2+ \cdots + \hat\theta_kx_k)]^2\]
Then \(\hat \theta\) can be estimated by:
\(\operatorname*{arg\,min}_\theta L(\theta)\)
The loss function can be written in matrix form (using more familiar \(\beta\) term instead of \(\theta\)):
\[L(\beta)=\varepsilon^2=\underbrace{(Y-X\beta)^T}_{\varepsilon^T_{(1xn)}}\underbrace{(Y-X\beta)}_{\varepsilon_{(nx1)}}\]
Both the matrix differentiation vector or matrix manipulation result in the same estimation of \(\beta\).
\[\hat\beta=(X^TX)^{-1}X^TY\]
Maximum Likelihood Estimation
Likelihood
Let, \(X\sim N(\mu, \sigma^2)\) then the distribution function is,
\[P(x) = \frac{1}{{\sigma \sqrt {2\pi }
}}e^{-(x - \mu)^2/2\sigma ^2 }\]
Assume that we observe, \(X=\{x_1,x_2,\cdots,x_n\}\) and the
estimation problem of \(\mu,\sigma\)
can be formulated as the solution of the maximization of the likelihood
function,
\[P(x_1,x_2,...,x_n|parameters)\]
Let, \(Y\),
\[y_i \sim N(x_i^T\beta, \sigma^2)=x_i^T\beta+N(0, \sigma^2)\]
\[P(Y|X,\beta,\sigma)=\prod_{1}^{n}P(y_i|x_i,\beta,\sigma)\]
\[P(Y|X,\theta,\sigma)=\prod_{1}^{n}(2 \pi \sigma^2)^{-1/2} e^{-\frac {1}{2\sigma^2}(y_i-x_i^T\beta)^2}\]
\[P(Y|X,\theta,\sigma)=(2 \pi \sigma^2)^{-n/2} e^{-\frac {1}{2\sigma^2}(\sum_{i=n}^{n}y_i-x_i^T\beta)^2}\]
Logarithm of this function (\(P>0\)) which is called Log-Likelihood.
\[Log(L(\beta))=l(\beta)=-\frac{n}{2}ln(2 \pi \sigma^2)- \frac{1}{2 \sigma^2}(Y-X\beta)^T(Y-X\beta)\] Take the partial derivatives, (i) assuming \(\sigma\) is known to find \(\beta\) (should result in the same equation as Least Squares), and (ii) assuming \(\beta\) is known to find \(sigma\) (should result in the estimator of Variance,
\(E[(y_i-x_i^T\beta)^2]=\frac{1}{n}(Y-X\beta)^T(Y-X\beta)\))
Then prediction given data, \((X_*,y)\) is,
\[P(y|X_*,\beta,\sigma)=N(y|X_*^T\beta_{ML},\sigma^2)\]
Let \(RSS=\varepsilon_1^2+\varepsilon_2^2+\cdots+\varepsilon_n^2=\sum_{1}^{n}\varepsilon_i^2\)
This example is from Applied Econometrics with R, Kleiber, Christian, Zeileis, Achim, Chapter 3
Journal Pricing Data will be used
[1] 180 10
[1] "title" "publisher" "society" "price" "pages"
[6] "charpp" "citations" "foundingyear" "subs" "field"
subs price citeprice
Min. : 2.0 Min. : 20.0 Min. : 0.005223
1st Qu.: 52.0 1st Qu.: 134.5 1st Qu.: 0.464495
Median : 122.5 Median : 282.0 Median : 1.320513
Mean : 196.9 Mean : 417.7 Mean : 2.548455
3rd Qu.: 268.2 3rd Qu.: 540.8 3rd Qu.: 3.440171
Max. :1098.0 Max. :2120.0 Max. :24.459459
Regression with Log-transformation (here is log-log) is better suited
The equation:
\[ln(subs)_i=\beta_0+\beta_1 ln(citeprice)_i + \varepsilon_i\]
[1] "lm"
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)
Residuals:
Min 1Q Median 3Q Max
-2.72478 -0.53609 0.03721 0.46619 1.84808
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.76621 0.05591 85.25 <2e-16 ***
log(citeprice) -0.53305 0.03561 -14.97 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7497 on 178 degrees of freedom
Multiple R-squared: 0.5573, Adjusted R-squared: 0.5548
F-statistic: 224 on 1 and 178 DF, p-value: < 2.2e-16
Call: The formula with associated data. Here data=journal formula as given above equation.
Residuals (error terms etc): 5-number summary of \(\varepsilon_i\)
Coefficients: \(\beta_0: \: intercept, \: \beta_1:\: ln(citeprice)\),
Estimate: \(\hat \beta_0\) and \(\hat \beta_1\)
Std. Error: Standard Error for Estimates
t value: t-stats for Estimations, \(t−value = {\hat \beta_i \over Std.Error}\)
Pr(>|t|): p-value for hypothesis test, a small p-value rules out the association is due to chance
Residual Standard Error, \(RSE=\sqrt{\frac{1}{n-2}RSS}\):
\(\sqrt{Var(\varepsilon_i)}=\sqrt{\sigma^2_\varepsilon}=\hat \sigma_\varepsilon\)
Even if the true parameters of the model is known and this linear model is the true model, this number shows how prediction deviates on average from the true value of dependent variable.
RSE is considered as lack of fit
Degrees of freedom: \(n-p=number \: of \: observation-number \: of \: parameters\)
R-squared: \(R^2\) and Adjusted R-squared
F-statistics and its p-value
Memory Recall R-squared
\(R^2\) is the percentage of the dependent variable’s variance explained by independent variable(s). It must be then,
\(1-\frac{var(residuals)}{var(dependent variable)}\)
where, var(.): variance
\(Adjusted \: R^2={R_{adj}^2 = 1 - [\frac{(1-R^2)(n-1)}{n-k-1}]}\)
\(n\) is the number of observations and \(k\) is the number of independent variables. \(R^2\) increases with addition of new variables. \(R_{adj}^2\) increases with only if the addition of new independent variables that improves the model. This is done by penalizing with the number of independent variables.
-Data Set: CPS1988 data frame March, 1988 Current Population Survey, US Census Bureau
'data.frame': 28155 obs. of 7 variables:
$ wage : num 355 123 370 755 594 ...
$ education : int 7 12 9 11 12 16 8 12 12 14 ...
$ experience: int 45 1 9 46 36 22 51 34 0 18 ...
$ ethnicity : Factor w/ 2 levels "cauc","afam": 1 1 1 1 1 1 1 1 1 1 ...
$ smsa : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ region : Factor w/ 4 levels "northeast","midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
$ parttime : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
wage education experience ethnicity smsa
Min. : 50.05 Min. : 0.00 Min. :-4.0 cauc:25923 no : 7223
1st Qu.: 308.64 1st Qu.:12.00 1st Qu.: 8.0 afam: 2232 yes:20932
Median : 522.32 Median :12.00 Median :16.0
Mean : 603.73 Mean :13.07 Mean :18.2
3rd Qu.: 783.48 3rd Qu.:15.00 3rd Qu.:27.0
Max. :18777.20 Max. :18.00 Max. :63.0
region parttime
northeast:6441 no :25631
midwest :6863 yes: 2524
south :8760
west :6091
wage, (in dollars per week), education, (Number of years of education) and experience, (Number of years of potential work experience) numeric and integers, ethnicity, factor with 2 levels, Caucasian (cauc) and African-American (afam)
Experience, age - education - 6 (potential experience)
\[ln(wage_i)=\beta_0 + \beta_1 \: experience_i+ \beta_2 \: experience_i^2 + \beta_3 \: education_i + \beta_4 \: ethnicity_i + \varepsilon\]
Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education +
ethnicity, data = CPS1988)
Residuals:
Min 1Q Median 3Q Max
-2.9428 -0.3162 0.0580 0.3756 4.3830
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.321e+00 1.917e-02 225.38 <2e-16 ***
experience 7.747e-02 8.800e-04 88.03 <2e-16 ***
I(experience^2) -1.316e-03 1.899e-05 -69.31 <2e-16 ***
education 8.567e-02 1.272e-03 67.34 <2e-16 ***
ethnicityafam -2.434e-01 1.292e-02 -18.84 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5839 on 28150 degrees of freedom
Multiple R-squared: 0.3347, Adjusted R-squared: 0.3346
F-statistic: 3541 on 4 and 28150 DF, p-value: < 2.2e-16
Interpret the regression table:
For the ethnicity, there is ethnicityafam coefficient (in other words ethnicity==afam). But ethnicity has another level, cauc (this is called treatment contrast, treatment afam is compared with reference group is cauc. This type of variable is called as dummy variable (or indicator variable).
Dependent variable: | |
log(wage) | |
experience | 0.077*** |
(0.001) | |
I(experience2) | -0.001*** |
(0.00002) | |
education | 0.086*** |
(0.001) | |
ethnicityafam | -0.243*** |
(0.013) | |
Constant | 4.321*** |
(0.019) | |
Observations | 28,155 |
R2 | 0.335 |
Adjusted R2 | 0.335 |
Residual Std. Error | 0.584 (df = 28150) |
F Statistic | 3,541.036*** (df = 4; 28150) |
Note: | p<0.1; p<0.05; p<0.01 |
For any two nested models, this can be done using the function anova()
The Akaike’s Information Criterion, AIC, and the Bayesian Information Criterion - BIC (sometimes called as Schwarz criterion)
Number of observations is large
\(AIC=-2\:ln(Likelihood) + 2\:p, \: n:number \: of \:observations\)
Otherwise correction to get second-order AIC, AICc,
\(AICc=-2\:ln(Likelihood) + 2\:p + \frac{2p(p+1)}{n -(p+1)}\) )
Bayesian Information Criterion
\(BIC=-2\:ln(Likelihood) + ln(n)\:p\)
Let’s have another example: state.X77
data set (from
datasets
package)
Variables to be use are, Income (dependent, per capita income, 1974), Frost (temperature below freezing, 1931–1960), Illiteracy (1970, percent of population), Murder (rate per 100,000 population 1976)
Dependent variable: | |
Income | |
Frost | -1.254 |
(2.110) | |
Illiteracy | -610.715*** |
(213.138) | |
Murder | 23.074 |
(30.940) | |
Constant | 5,111.097*** |
(416.576) | |
Observations | 50 |
R2 | 0.209 |
Adjusted R2 | 0.157 |
Residual Std. Error | 564.079 (df = 46) |
F Statistic | 4.049** (df = 3; 46) |
Note: | p<0.1; p<0.05; p<0.01 |
Adding HS Grad
Variable
Dependent variable: | |
Income | |
Frost | -0.241 |
(1.872) | |
Illiteracy | -190.516 |
(217.049) | |
Murder | 30.005 |
(27.235) | |
HS Grad
|
44.975*** |
(11.758) | |
Constant | 2,074.005** |
(874.221) | |
Observations | 50 |
R2 | 0.403 |
Adjusted R2 | 0.350 |
Residual Std. Error | 495.427 (df = 45) |
F Statistic | 7.594*** (df = 4; 45) |
Note: | p<0.1; p<0.05; p<0.01 |
Here is the comparison of both model parameters
predict the specific outcomes for given independent variables
\(y = x^T\beta+\varepsilon\)
\(\implies \hat{y} = x^T\hat{\beta}\) since \(E[\varepsilon]=0\)
Since this is the prediction, there must be an uncertainty associated with it
Both uncertainty associated \(\hat{\beta}\) as well as uncertainty about prediction, the variance of \(\varepsilon\) must be included
The confidence interval for predicting specific outcome is given as:
\(\hat{y}^*\pm t_{n-p}^{(\alpha/2)}\hat{\sigma}\sqrt{x^{*T}(X^TX)^{-1}x^* + 1}\)
predict the mean response for given independent variables
The point estimate is the same, \(\hat{y} = x^T\hat{\beta}\). This time only the uncertainty of \(\hat{\beta}\), \(Var(\hat{\beta})\) must be considered
The confidence interval for the mean response is given as:
\(\hat{y}^*\pm t_{n-p}^{(\alpha/2)}\hat{\sigma}\sqrt{x{*^T}(X^TX)^{-1}x^*}\)
where, \(t_{n-p}^{\alpha/2}\) is a t-statistic with \(n-p\) degrees of freedom and \(199(1-\alpha)\) is the confidence level
See the help page of predict.lm() function
Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)
Residuals:
Min 1Q Median 3Q Max
-2.72478 -0.53609 0.03721 0.46619 1.84808
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.76621 0.05591 85.25 <2e-16 ***
log(citeprice) -0.53305 0.03561 -14.97 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7497 on 178 degrees of freedom
Multiple R-squared: 0.5573, Adjusted R-squared: 0.5548
F-statistic: 224 on 1 and 178 DF, p-value: < 2.2e-16
Prediction of Specific Outcome for citeprice=2, level=0.95
fit lwr upr
1 4.180593 2.695128 5.666058
Prediction of Mean Response for citeprice=2, level=0.95
fit lwr upr
1 4.180593 4.047897 4.313289