I. Ozkan
Fall 2025
An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 4
Using R for Introductory Econometrics, Florian Heiss, Chapter 17, pp:253-260
Linear Probability Model
Probit Regression
Logit Regression
\(\{(y_1, X_1),(y_2, X_2),...,(y_n, X_n)\}\)
with dependent variable
\(y_i \in \{0,1\}\)
and covariates
\(X_i=(x_{i1},x_{i2},..,x_{ik})\).
\(E(Y\vert X_1,X_2,\dots,X_k) = P(Y=1\vert X_1, X_2,\dots, X_k)\)
and
\(P(y_i = 1 \vert x_{i1}, x_{i2}, \dots, x_{ik}) = \beta_0 + \beta_1 + x_{i1} + \beta_2 x_{i2} + \dots + \beta_k x_{ik}\)
The \(\beta_j\) is interpreted as the change in the probability that \(y_i=1\) holding other variables are constants
Notes: \(R^2\) has no meaning and \(\varepsilon\) is always heteroskedastic (hence robust standard errors should be considered)
Let’s use AER package Home Mortgage Disclosure Act Data. See Introduction to Econometrics with R Book, Section 11
The variable deny is a binary variable that indicates the application is either denied (deny=yes) or accepted (deny=no).
deny is first modeled with one explanatory
variable, pirat, the ratio of expected monthly loan
payment to the applicants income and then it is modeled with adding
afam (African American) variable later
The usual graph to start with shown next (it shows only 3 variables, deny, pirat and afam):
| Home Mortgage Disclosure Act Data | ||
| deny | pirat | afam | 
|---|---|---|
| no | 0.221 | no | 
| no | 0.265 | no | 
| no | 0.372 | no | 
| no | 0.320 | no | 
| no | 0.360 | no | 
| no | 0.240 | no | 
| HMDA Dataset is part of AER Package | ||
\(deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + u_i\)
| Dependent variable: | |
| deny | |
| pirat | 0.604*** | 
| (0.061) | |
| Constant | -0.080*** | 
| (0.021) | |
| Observations | 2,380 | 
| R2 | 0.040 | 
| Adjusted R2 | 0.039 | 
| Residual Std. Error | 0.318 (df = 2378) | 
| F Statistic | 98.406*** (df = 1; 2378) | 
| Note: | p<0.1; p<0.05; p<0.01 | 
\(\widehat{deny} = -\underset{(0.032)}{0.080} + \underset{(0.098)}{0.604} (P/I \ ratio)\)
 
 
 
\(deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + + \beta_2 \times black+ \varepsilon_i\)
| Dependent variable: | |
| deny | |
| pirat | 0.559*** | 
| (0.060) | |
| blackyes | 0.177*** | 
| (0.018) | |
| Constant | -0.091*** | 
| (0.021) | |
| Observations | 2,380 | 
| R2 | 0.076 | 
| Adjusted R2 | 0.075 | 
| Residual Std. Error | 0.312 (df = 2377) | 
| F Statistic | 97.760*** (df = 2; 2377) | 
| Note: | p<0.1; p<0.05; p<0.01 | 
And the robust standard errors
\(\widehat{deny} = \, -\underset{(0.033)}{0.091} + \underset{(0.104)}{0.559} (P/I \ ratio) + \underset{(0.025)}{0.177} black\)
 
 
 
The new variable (black) has a positive effect. It indicates that there exist an explanatory power.
The weakness of the linear probability model is in it’s assumption: The conditional Probability function is linear. This assumption does not limit the probability between \(0\) and \(1\).
Probit and Logit Regression Models are commonly used to overcome this weakness
Recall that, \(y_i\) takes only \(0\) or \(1\), \(y_i \in \{0,1\}\), one way to overcome the weakness of the linear probability model is to transform \(y \implies F(y)\) such that, \(Y \in (-\infty, +\infty)\).
The transformation of the dependent variable is generally performed with so called link function.
Since, \(y_i\) is treated as probability, \(P(y_i)=1, y_i=1; \; P(y_i)=0, y_i=0\), inverse cumulative distribution function can transform \(y_i\) to a continuous variable
In probit regression:
\(E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\)
\(\Phi(.)\) is the cumulative distribution function, or,
\(Y = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\)
where, \(\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)\)
and hence,
\(\Phi^{-1}(Y) = \beta_0 + \beta_1 x+ \cdots + \beta_k x_k\)
| Dependent variable: | |
| deny | |
| pirat | 2.968*** | 
| (0.386) | |
| Constant | -2.194*** | 
| (0.138) | |
| Observations | 2,380 | 
| Log Likelihood | -831.792 | 
| Akaike Inf. Crit. | 1,667.585 | 
| Note: | p<0.1; p<0.05; p<0.01 | 
Call:
glm(formula = deny ~ pirat, family = binomial(link = "probit"), 
    data = HMDA)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.1941     0.1378 -15.927  < 2e-16 ***
pirat         2.9679     0.3858   7.694 1.43e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 1744.2  on 2379  degrees of freedom
Residual deviance: 1663.6  on 2378  degrees of freedom
AIC: 1667.6
Number of Fisher Scoring iterations: 6
 
 
 
\(\text{deviance} = -2 \times \left[\ln(\hat f^{}_{s}) - \ln(\hat f^{}_{model}) \right]\) where \(\hat f^{}_{s}\) is the maximized likelihood for a model which assumes that each observation has its own parameters.
\(\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\)
where \(\hat f^{}_{null}\) is a model with just an intercept. This can be calculated to assess the model.
From the summary table above,
change in deviance= 1744.2 - 1663.6=80.6 and df=2379-2378=1 and the p-value is 2.7833398^{-19}
\(\widehat{P(deny\vert P/I \ ratio}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} (P/I \ ratio))\)
| Dependent variable: | |
| deny | |
| pirat | 2.742*** | 
| (0.380) | |
| blackyes | 0.708*** | 
| (0.083) | |
| Constant | -2.259*** | 
| (0.137) | |
| Observations | 2,380 | 
| Log Likelihood | -797.136 | 
| Akaike Inf. Crit. | 1,600.272 | 
| Note: | p<0.1; p<0.05; p<0.01 | 
\(\widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} (P/I \ ratio) + \underset{(0.08)}{0.71} black)\)
\(Y \in [0, +\infty) \implies ln(Y) \in (-\infty,+\infty)\)
To do so, odds is used
Assumption: linear relationship between the predictor variables, and the log-odds of the event that \(\displaystyle Y=1\)
Probability and odds
\(\text{probability}=\frac {N_{Y=1}}{N}\)
\(P(X\leq2)=\frac{2}{6} \implies odds=\frac{2/6}{1-2/6}=\frac{2}{4}\)
\(\text{odds}=\frac {\text{Frequency of Y=1}}{\text{Frequency of Y} \neq 1}\)
\(\implies \text{odds}=\frac {\text{(Frequency of Y=1)}/N}{\text{(Frequency of Y} \neq 1)/N}\)
\(\implies \text{odds}=\frac{\text{probability}}{1-\text{probability}}\)
\(\implies \text{probability}=\frac{\text{odds}}{1+\text{odds}}\)
Odds ratio is the ratio of odds.
Similar to Probit Regression. The cumulative distribution function, CDF, is different for Logit Regression.
PDF:
\(F(x) = \frac{1}{1+e^{-x}}\)
The model,
\[\begin{align*} P(Y=1\vert X_1, X_2, \dots, X_k) =& \, F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k) \\ =& \, \frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}. \end{align*}\]
To understand this, let’s start with log odds
\[l=ln \left( \frac{P(Y=1)}{1-P(Y=1)} \right)= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\]
\[\implies P(Y=1)= \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}{1+e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}\]
\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio))\)
and
\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio) + \beta_2black)\)
| Dependent variable: | ||
| deny | ||
| (1) | (2) | |
| pirat | 5.884*** | 5.370*** | 
| (0.734) | (0.728) | |
| blackyes | 1.273*** | |
| (0.146) | ||
| Constant | -4.028*** | -4.126*** | 
| (0.269) | (0.268) | |
| Observations | 2,380 | 2,380 | 
| Log Likelihood | -830.094 | -795.695 | 
| Akaike Inf. Crit. | 1,664.188 | 1,597.390 | 
| Note: | p<0.1; p<0.05; p<0.01 | |
\[\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\]
\[\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} black)\]
| OddsRatio | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.018 | 0.010 | 0.030 | 
| pirat | 359.422 | 88.342 | 1565.122 | 
| OddsRatio | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.016 | 0.009 | 0.027 | 
| pirat | 214.941 | 53.848 | 931.657 | 
| blackyes | 3.571 | 2.675 | 4.747 | 
For a one unit increase in \(\text{P/I ratio}\), the odds of being denied to being accepted increase by a factor of 359.42 (without African American variable) or 214.94 (with African American variable)
Being black increases the odds of being rejected by a factor of 3.57 (or change the log odds of being denied by 1.273)
To follow the ISLR book, let’s calculate what is the probability of denial difference for black and white given (P/I ratio)=0.3,
\[P(Y=1|(P/Iratio=0.3, black=1))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3 + 1.27)}} \approx 0.224\]
\[P(Y=1|(P/Iratio=0.3, black=0))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3)}} \approx 0.075\]
There is a \(0.149\) probability difference of being rejected if the applicant is African American.
All of the above model specifications, majority of the variables are omitted
We may want to create a near full model
To follow the book, lvrat (loan to value ratio) variable is converted to
\[\begin{align*} lvrat = \begin{cases} \text{low} & \text{if} \ \ lvrat < 0.8, \\ \text{medium} & \text{if} \ \ 0.8 \leq lvrat \leq 0.95, \\ \text{high} & \text{if} \ \ lvrat > 0.95 \end{cases} \end{align*}\]
| Dependent variable: | |||||
| deny | |||||
| OLS | logistic | probit | |||
| (1) | (2) | (3) | (4) | (5) | |
| blackyes | 0.084*** | 0.688*** | 0.389*** | 0.371*** | 0.363*** | 
| (0.023) | (0.183) | (0.099) | (0.100) | (0.101) | |
| pirat | 0.449*** | 4.764*** | 2.442*** | 2.464*** | 2.622*** | 
| (0.114) | (1.332) | (0.673) | (0.654) | (0.665) | |
| hirat | -0.048 | -0.109 | -0.185 | -0.302 | -0.502 | 
| (0.110) | (1.298) | (0.689) | (0.689) | (0.715) | |
| lvratmedium | 0.031** | 0.464*** | 0.214*** | 0.216*** | 0.215** | 
| (0.013) | (0.160) | (0.082) | (0.082) | (0.084) | |
| lvrathigh | 0.189*** | 1.495*** | 0.791*** | 0.795*** | 0.836*** | 
| (0.050) | (0.325) | (0.183) | (0.184) | (0.185) | |
| chist | 0.031*** | 0.290*** | 0.155*** | 0.158*** | 0.344*** | 
| (0.005) | (0.039) | (0.021) | (0.021) | (0.108) | |
| mhist | 0.021* | 0.279** | 0.148** | 0.110 | 0.162 | 
| (0.011) | (0.138) | (0.073) | (0.076) | (0.104) | |
| phistyes | 0.197*** | 1.226*** | 0.697*** | 0.702*** | 0.717*** | 
| (0.035) | (0.203) | (0.114) | (0.115) | (0.116) | |
| insuranceyes | 0.702*** | 4.548*** | 2.557*** | 2.585*** | 2.589*** | 
| (0.045) | (0.576) | (0.305) | (0.299) | (0.306) | |
| selfempyes | 0.060*** | 0.666*** | 0.359*** | 0.346*** | 0.342*** | 
| (0.021) | (0.214) | (0.113) | (0.116) | (0.116) | |
| singleyes | 0.229*** | 0.230*** | |||
| (0.080) | (0.086) | ||||
| hschoolyes | -0.613*** | -0.604** | |||
| (0.229) | (0.237) | ||||
| unemp | 0.030* | 0.028 | |||
| (0.018) | (0.018) | ||||
| condominyes | -0.055 | ||||
| (0.096) | |||||
| I(mhist == 3) | -0.107 | ||||
| (0.301) | |||||
| I(mhist == 4) | -0.383 | ||||
| (0.427) | |||||
| I(chist == 3) | -0.226 | ||||
| (0.248) | |||||
| I(chist == 4) | -0.251 | ||||
| (0.338) | |||||
| I(chist == 5) | -0.789* | ||||
| (0.412) | |||||
| I(chist == 6) | -0.905* | ||||
| (0.515) | |||||
| Constant | -0.183*** | -5.707*** | -3.041*** | -2.575*** | -2.896*** | 
| (0.028) | (0.484) | (0.250) | (0.350) | (0.404) | |
| Observations | 2,380 | 2,380 | 2,380 | 2,380 | 2,380 | 
| R2 | 0.266 | ||||
| Adjusted R2 | 0.263 | ||||
| Log Likelihood | -635.637 | -636.847 | -628.614 | -625.064 | |
| Akaike Inf. Crit. | 1,293.273 | 1,295.694 | 1,285.227 | 1,292.129 | |
| Residual Std. Error | 0.279 (df = 2369) | ||||
| F Statistic | 85.974*** (df = 10; 2369) | ||||
| Note: | p<0.1; p<0.05; p<0.01 | ||||
Logistic Regression and Probit Regression produce similar results. When to use logistic regression depends on the preference of researcher
Linear Probability model violate the homoskedasticity and
normality of errors assumptions of linear regression. Standard errors
are invalid (and hence the hypothesis tests)
If the number of success relative to number of unsuccess is very small then the stability of the model is questionable
Both Logit and Probit regression require more observations than standard regression because maximum likelihood estimation is used for both of them
Pseudo-R-squared exists but the interpretation is different than the OLS regression. See UCLA IDRE web site for more details
Performance Measures obtained using probit or logit model are similar
Since \(R^2\) is invalid as a measure of fit, \(pseudo-R^2\) may be used (as explained before)
\[\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\]
\[\begin{align*} Y_i = \begin{cases} 1 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) > 0.5, \\ 0 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) < 0.5, \\ \end{cases} \end{align*}\]
Then \(Y_i\) can be assessed. The quality of the prediction is simply neglected other than creating two classes with probability threshold. This threshold may be set to another value(s) based on other measures, such as, Information Gain
For the threshold is set to \(0.5\), the mis-classification error is calculated as 0.1176
and the confusion matrix
     0   1
0 2089 274
1    6  11
\(TPR=\frac{TP}{TP+FN} \: and \: FPR=\frac{FP}{FP+TN}\)
This is like assigning random positive to positive than random negative. Comparing
\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)
\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} black)\)
For a brief discussion about them please do read: