I. Ozkan
Spring 2025
An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 4
Using R for Introductory Econometrics, Florian Heiss, Chapter 17, pp:253-260
Linear Probability Model
Probit Regression
Logit Regression
KNN (Read MIS301 Course Notes - pdf version is available on webonline)
\(\{(y_1, X_1),(y_2, X_2),...,(y_n, X_n)\}\)
with dependent variable
\(y_i \in \{0,1\}\)
and covariates
\(X_i=(x_{i1},x_{i2},..,x_{ik})\).
\[E(Y\vert X_1,X_2,\dots,X_k) = P(Y=1\vert X_1, X_2,\dots, X_k)\]
and
\[P(y_i = 1 \vert x_{i1}, x_{i2}, \dots, x_{ik}) = \beta_0 + \beta_1 + x_{i1} + \beta_2 x_{i2} + \dots + \beta_k x_{ik}\]
The \(\beta_j\) is interpreted as the change in the probability that \(y_i=1\) holding other variables are constants.
Notes: \(R^2\) has no meaning and \(\varepsilon\) is always heteroskedastic (hence robust standard errors should be considered).
Let’s use Mortgage data available with AER package. See Introduction to Econometrics with R Book, Section 11
The first few observations are:
deny pirat hirat lvrat chist mhist phist unemp selfemp insurance condomin
1 no 0.221 0.221 0.8000000 5 2 no 3.9 no no no
2 no 0.265 0.265 0.9218750 2 2 no 3.2 no no no
3 no 0.372 0.248 0.9203980 1 2 no 3.2 no no no
4 no 0.320 0.250 0.8604651 1 2 no 4.3 no no no
5 no 0.360 0.350 0.6000000 1 1 no 3.2 no no no
6 no 0.240 0.170 0.5105263 1 1 no 3.9 no no no
afam single hschool
1 no no yes
2 no yes yes
3 no no yes
4 no no yes
5 no no yes
6 no no yes
The variable deny is a binary variable that indicates the application is either denied (deny=yes) or accepted (deny=no).
deny is first modeled with one explanatory variable, pirat, the ratio of expected monthly loan payment to the applicants income and then it is modeled with adding afam (African American) variable later
The usual graph to start with shown below (it shows only 3 variables, deny, pirat and afam):
\[deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + \beta_2 \times afam_i + u_i\]
Dependent variable: | |
deny | |
pirat | 0.559*** |
(0.060) | |
afamyes | 0.177*** |
(0.018) | |
Constant | -0.091*** |
(0.021) | |
Observations | 2,380 |
R2 | 0.076 |
Adjusted R2 | 0.075 |
Residual Std. Error | 0.312 (df = 2377) |
F Statistic | 97.760*** (df = 2; 2377) |
Note: | p<0.1; p<0.05; p<0.01 |
And the robust standard errors
\[\widehat{deny} = \, -\underset{(0.033)}{0.091} + \underset{(0.104)}{0.559} (P/I \ ratio) + \underset{(0.025)}{0.177} \cdot black\]
The weakness of the linear probability model is in it’s assumption: The conditional Probability function is linear. This assumption does not limit the probability between \(0\) and \(1\).
Many model assumptions do not hold (homoschedasticity of errors, normality of errors, etc.)
Probit and Logit Regression Models are commonly used to overcome this weakness
Recall that, \(y_i\) takes only \(0\) or \(1\), \(y_i \in \{0,1\}\), one way to overcome the weakness of the linear probability model is to transform \(y \implies F(y)\) such that, \(Y \in (-\infty, +\infty)\).
The transformation of the dependent variable is generally performed with so called link function.
Since, \(y_i\) is treated as probability, \(P(y_i)=1, y_i=1; \; P(y_i)=0, y_i=0\), inverse cumulative distribution function can transform \(y_i\) to a continuous variable
In probit regression:
\[E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\]
\(\Phi(.)\) is the cumulative distribution function, or,
\[Y = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\]
where, \(\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)\)
and hence,
\[\Phi^{-1}(Y) = \beta_0 + \beta_1 x+ \cdots + \beta_k x_k\]
Dependent variable: | |
deny | |
pirat | 2.968*** |
(0.386) | |
Constant | -2.194*** |
(0.138) | |
Observations | 2,380 |
Log Likelihood | -831.792 |
Akaike Inf. Crit. | 1,667.585 |
Note: | p<0.1; p<0.05; p<0.01 |
Call:
glm(formula = deny ~ pirat, family = binomial(link = "probit"),
data = HMDA)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.1941 0.1378 -15.927 < 2e-16 ***
pirat 2.9679 0.3858 7.694 1.43e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1744.2 on 2379 degrees of freedom
Residual deviance: 1663.6 on 2378 degrees of freedom
AIC: 1667.6
Number of Fisher Scoring iterations: 6
\(\text{deviance} = -2 \times \left[\ln(\hat f^{}_{s}) - \ln(\hat f^{}_{model}) \right]\) where \(\hat f^{}_{s}\) is the maximized likelihood for a model which assumes that each observation has its own parameters.
\[\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\]
where \(\hat f^{}_{null}\) is a model with just an intercept. This can be calculated to assess the model.
From the summary table above,
change in deviance= 1744.2 - 1663.6=80.6 and df=2379-2378=1 and the p-value is 2.7833398^{-19}
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.19415 0.18901 -11.6087 < 2.2e-16 ***
pirat 2.96787 0.53698 5.5269 3.259e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ \widehat{P(deny\vert P/I \ ratio}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} (P/I \ ratio))\]
Dependent variable: | |
deny | |
pirat | 2.742*** |
(0.380) | |
afamyes | 0.708*** |
(0.083) | |
Constant | -2.259*** |
(0.137) | |
Observations | 2,380 |
Log Likelihood | -797.136 |
Akaike Inf. Crit. | 1,600.272 |
Note: | p<0.1; p<0.05; p<0.01 |
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.258787 0.176608 -12.7898 < 2.2e-16 ***
pirat 2.741779 0.497673 5.5092 3.605e-08 ***
afamyes 0.708155 0.083091 8.5227 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[\widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} (P/I \ ratio) + \underset{(0.08)}{0.71} \cdot afam)\]
\(y_i \in \{0,1\}\), can be transformed to \(y \implies F(y)\) such that, \(Y \in [0, +\infty) \implies ln(Y) \in (-\infty,+\infty)\)
To do so, odds is used
Assumption: linear relationship between the predictor variables, and the log-odds of the event that \(\displaystyle Y=1\)
Probability and odds
\(\text{probability}=\frac {N_{Y=1}}{N}\)
\(P(X\leq2)=\frac{2}{6} \implies odds=\frac{2/6}{1-2/6}=\frac{2}{4}\)
\(\text{odds}=\frac {\text{Frequency of Y=1}}{\text{Frequency of Y} \neq 1}\)
\(\implies \text{odds}=\frac {\text{(Frequency of Y=1)}/N}{\text{(Frequency of Y} \neq 1)/N}\)
\(\implies \text{odds}=\frac{\text{probability}}{1-\text{probability}}\)
\(\implies \text{probability}=\frac{\text{odds}}{1+\text{odds}}\)
Odds ratio is the ratio of odds.
Similar to Probit Regression. The cumulative distribution function, CDF, is different for Logit Regression.
CDF:
\(F(x) = \frac{1}{1+e^{-x}}\)
The model,
\(\begin{align*} P(Y=1\vert X_1, X_2, \dots, X_k) =& \, F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k) \\ =& \, \frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}. \end{align*}\)
To understand this, let’s start with log odds
\(l=ln \left( \frac{P(Y=1)}{1-P(Y=1)} \right)= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\)
\(\implies P(Y=1)= \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}{1+e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}\)
\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio))\)
and
\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio) + \beta_2black)\)
Dependent variable: | ||
deny | ||
(1) | (2) | |
pirat | 5.884*** | 5.370*** |
(0.734) | (0.728) | |
afamyes | 1.273*** | |
(0.146) | ||
Constant | -4.028*** | -4.126*** |
(0.269) | (0.268) | |
Observations | 2,380 | 2,380 |
Log Likelihood | -830.094 | -795.695 |
Akaike Inf. Crit. | 1,664.188 | 1,597.390 |
Note: | p<0.1; p<0.05; p<0.01 |
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.02843 0.35898 -11.2218 < 2.2e-16 ***
pirat 5.88450 1.00015 5.8836 4.014e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.12556 0.34597 -11.9245 < 2.2e-16 ***
pirat 5.37036 0.96376 5.5723 2.514e-08 ***
afamyes 1.27278 0.14616 8.7081 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)
\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} \cdot afam)\)
OddsRatio | 2.5 % | 97.5 % | |
---|---|---|---|
(Intercept) | 0.018 | 0.010 | 0.030 |
pirat | 359.422 | 88.342 | 1565.122 |
OddsRatio | 2.5 % | 97.5 % | |
---|---|---|---|
(Intercept) | 0.016 | 0.009 | 0.027 |
pirat | 214.941 | 53.848 | 931.657 |
afamyes | 3.571 | 2.675 | 4.747 |
For a one unit increase in \(\text{P/I ratio}\), the odds of being denied to being accepted increase by a factor of 359.42 (without African American variable) or 214.94 (with African American variable)
Being black increases the odds of being rejected by a factor of 3.57 (or change the log odds of being denied by 1.273)
To follow the book, let’s calculate what is the probability of denial difference for black and white given (P/I ratio)=0.3,
\[P(Y=1|(P/Iratio=0.3, black=1))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3 + 1.27)}} \approx 0.224\]
\[P(Y=1|(P/Iratio=0.3, black=0))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3)}} \approx 0.075\]
There is a \(0.149\) probability difference of being rejected if the applicant is African American.
Logistic Regression and Probit Regression produce similar results. When to use logistic regression depends on the preference of researcher
Linear Probability model violate the homoskedasticity and normality of errors assumptions of linear regression. Standard errors are invalid (and hence the hypothesis tests)
If the number of success relative to number of unsuccess is very small then the stability of the model is questionable
Both Logit and Probit regression require more observations than standard regression because maximum likelihood estimation is used for both of them
Pseudo-R-squared exists but the interpretation is different than the OLS regression. See UCLA IDRE web site for more details
Measures for performance of a probit or ligit model are similar to performance of classification models
Since \(R^2\) is invalid as a measure of fit, \(pseudo-R^2\) may be used (as explained before)
\(\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\)
\(\begin{align*} Y_i = \begin{cases} 1 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) > 0.5, \\ 0 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) < 0.5, \\ \end{cases} \end{align*}\)
Then \(Y_i\) can be assessed. The quality of the prediction is simply neglected other than creating two classes with probability threshold. This threshold may be set to another value(s) based on other measures, such as, Information Gain
For the threshold is set to \(0.5\), the mis-classification error is calculated as 0.1176
and the confusion matrix
0 1
0 2089 274
1 6 11
\(TPR=\frac{TP}{TP+FN} \: and \: FPR=\frac{FP}{FP+TN}\)
This is like assigning random positive to positive than random negative. Comparing
\(\widehat{P(deny=1 \vert P/I ratio)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)
\[\widehat{P(deny=1 \vert P/I ratio, afam)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} \cdot afam)\]
More: Read the following nice tutorial:
https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
Example McFadden or pseudo R²
interpretation
https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation
\(1 - LL_{mod} / LL_0\) \(LL\) is always negative very good if \(0.2< \rho^2 <0.4\) Excellent > 0.5
Here are some of the measures:
Probit Regression Model
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML
-797.13603842 -872.08530450 149.89853216 0.08594259 0.06104017
r2CU
0.11750696
Logistic Regression Model
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML
-795.69520837 -872.08530450 152.78019227 0.08759475 0.06217635
r2CU
0.11969421
Linearity
In logistic regression, we assume the relationship is linear on the logit scale (link scale)
Diagnose the violations of linearity by plotting each predictor against its component-plus-residual
Add a linear fit of the points (a dashed red line in the below example), and smoothed conditional mean line (blue line)
If a linear fit is appropriate, the two lines will be close to each other
The systematic change (rise and/or fall) in the smoothed line suggests a nonlinear relationship
If there exists a nonlinear relationship then consider adding a new (or transformed) variable
Example: HMDA data, logistic regression
Multicollinearity
It is similar to linear model
Example given above has only one scalar variable. Hence new model will be created
Variance Inflation Factors (VIFs)
Call:
glm(formula = deny ~ pirat + hirat + lvrat + unemp + afam, family = binomial(link = "logit"),
data = HMDA)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.34016 0.47195 -13.434 < 2e-16 ***
pirat 5.77636 0.95106 6.074 1.25e-09 ***
hirat -1.17531 1.10280 -1.066 0.28654
lvrat 2.69435 0.44695 6.028 1.66e-09 ***
unemp 0.08117 0.03029 2.680 0.00736 **
afamyes 1.15905 0.15062 7.695 1.41e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1744.2 on 2379 degrees of freedom
Residual deviance: 1544.6 on 2374 degrees of freedom
AIC: 1556.6
Number of Fisher Scoring iterations: 5
Variance Inflation Factors
pirat hirat lvrat unemp afam
1.633048 1.635978 1.023026 1.019203 1.037514
Outlier Effects
DFFITS and DFBETAS can be used to detect influential observations
DFFITS (for each observation), measuring how much prediction for that point would change
DFBETAS (for each predictor)