Binary Dependent Variable - Classification

I. Ozkan

Fall 2025

Preliminary Readings

Learning Objectives

Binary Dependent Variable

\(\{(y_1, X_1),(y_2, X_2),...,(y_n, X_n)\}\)

with dependent variable

\(y_i \in \{0,1\}\)

and covariates

\(X_i=(x_{i1},x_{i2},..,x_{ik})\).

\(E(Y\vert X_1,X_2,\dots,X_k) = P(Y=1\vert X_1, X_2,\dots, X_k)\)

and

\(P(y_i = 1 \vert x_{i1}, x_{i2}, \dots, x_{ik}) = \beta_0 + \beta_1 + x_{i1} + \beta_2 x_{i2} + \dots + \beta_k x_{ik}\)

Binary Dependent Variable: Linear Probability Model

Binary Dependent Variable: Linear Probability Model



Home Mortgage Disclosure Act Data
deny pirat afam
no 0.221 no
no 0.265 no
no 0.372 no
no 0.320 no
no 0.360 no
no 0.240 no
HMDA Dataset is part of AER Package


Binary Dependent Variable: Linear Probability Model

\(deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + u_i\)


Dependent variable:
deny
pirat 0.604***
(0.061)
Constant -0.080***
(0.021)
Observations 2,380
R2 0.040
Adjusted R2 0.039
Residual Std. Error 0.318 (df = 2378)
F Statistic 98.406*** (df = 1; 2378)
Note: p<0.1; p<0.05; p<0.01

\(\widehat{deny} = -\underset{(0.032)}{0.080} + \underset{(0.098)}{0.604} (P/I \ ratio)\)





Binary Dependent Variable

\(deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + + \beta_2 \times black+ \varepsilon_i\)


Dependent variable:
deny
pirat 0.559***
(0.060)
blackyes 0.177***
(0.018)
Constant -0.091***
(0.021)
Observations 2,380
R2 0.076
Adjusted R2 0.075
Residual Std. Error 0.312 (df = 2377)
F Statistic 97.760*** (df = 2; 2377)
Note: p<0.1; p<0.05; p<0.01

And the robust standard errors

\(\widehat{deny} = \, -\underset{(0.033)}{0.091} + \underset{(0.104)}{0.559} (P/I \ ratio) + \underset{(0.025)}{0.177} black\)





Probit Regression

In probit regression:

\(E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\)

\(\Phi(.)\) is the cumulative distribution function, or,

\(Y = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\)

where, \(\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)\)

and hence,

\(\Phi^{-1}(Y) = \beta_0 + \beta_1 x+ \cdots + \beta_k x_k\)

Probit Regression

Probit Regression


Dependent variable:
deny
pirat 2.968***
(0.386)
Constant -2.194***
(0.138)
Observations 2,380
Log Likelihood -831.792
Akaike Inf. Crit. 1,667.585
Note: p<0.1; p<0.05; p<0.01

Call:
glm(formula = deny ~ pirat, family = binomial(link = "probit"), 
    data = HMDA)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.1941     0.1378 -15.927  < 2e-16 ***
pirat         2.9679     0.3858   7.694 1.43e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1744.2  on 2379  degrees of freedom
Residual deviance: 1663.6  on 2378  degrees of freedom
AIC: 1667.6

Number of Fisher Scoring iterations: 6





\(\text{deviance} = -2 \times \left[\ln(\hat f^{}_{s}) - \ln(\hat f^{}_{model}) \right]\) where \(\hat f^{}_{s}\) is the maximized likelihood for a model which assumes that each observation has its own parameters.

\(\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\)

where \(\hat f^{}_{null}\) is a model with just an intercept. This can be calculated to assess the model.

From the summary table above,

change in deviance= 1744.2 - 1663.6=80.6 and df=2379-2378=1 and the p-value is 2.7833398^{-19}

Probit Regression

\(\widehat{P(deny\vert P/I \ ratio}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} (P/I \ ratio))\)

Probit Regression

Dependent variable:
deny
pirat 2.742***
(0.380)
blackyes 0.708***
(0.083)
Constant -2.259***
(0.137)
Observations 2,380
Log Likelihood -797.136
Akaike Inf. Crit. 1,600.272
Note: p<0.1; p<0.05; p<0.01








\(\widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} (P/I \ ratio) + \underset{(0.08)}{0.71} black)\)

Logistic Regression (Logit Regression)

\(Y \in [0, +\infty) \implies ln(Y) \in (-\infty,+\infty)\)

Probability and odds

\(\text{probability}=\frac {N_{Y=1}}{N}\)

\(P(X\leq2)=\frac{2}{6} \implies odds=\frac{2/6}{1-2/6}=\frac{2}{4}\)

\(\text{odds}=\frac {\text{Frequency of Y=1}}{\text{Frequency of Y} \neq 1}\)

\(\implies \text{odds}=\frac {\text{(Frequency of Y=1)}/N}{\text{(Frequency of Y} \neq 1)/N}\)

\(\implies \text{odds}=\frac{\text{probability}}{1-\text{probability}}\)

\(\implies \text{probability}=\frac{\text{odds}}{1+\text{odds}}\)

Logistic Regression

\(F(x) = \frac{1}{1+e^{-x}}\)

Logistic Regression

The model,

\[\begin{align*} P(Y=1\vert X_1, X_2, \dots, X_k) =& \, F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k) \\ =& \, \frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}. \end{align*}\]

To understand this, let’s start with log odds

\[l=ln \left( \frac{P(Y=1)}{1-P(Y=1)} \right)= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\]

\[\implies P(Y=1)= \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}{1+e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}\]

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio))\)

and

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio) + \beta_2black)\)

Dependent variable:
deny
(1) (2)
pirat 5.884*** 5.370***
(0.734) (0.728)
blackyes 1.273***
(0.146)
Constant -4.028*** -4.126***
(0.269) (0.268)
Observations 2,380 2,380
Log Likelihood -830.094 -795.695
Akaike Inf. Crit. 1,664.188 1,597.390
Note: p<0.1; p<0.05; p<0.01

Logistic Regression (Model Fit)

\[\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\]

\[\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} black)\]

OddsRatio 2.5 % 97.5 %
(Intercept) 0.018 0.010 0.030
pirat 359.422 88.342 1565.122
OddsRatio 2.5 % 97.5 %
(Intercept) 0.016 0.009 0.027
pirat 214.941 53.848 931.657
blackyes 3.571 2.675 4.747

\[P(Y=1|(P/Iratio=0.3, black=1))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3 + 1.27)}} \approx 0.224\]

\[P(Y=1|(P/Iratio=0.3, black=0))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3)}} \approx 0.075\]

There is a \(0.149\) probability difference of being rejected if the applicant is African American.

Logistic Regression

Logistic Regression, Complete Model Specs

\[\begin{align*} lvrat = \begin{cases} \text{low} & \text{if} \ \ lvrat < 0.8, \\ \text{medium} & \text{if} \ \ 0.8 \leq lvrat \leq 0.95, \\ \text{high} & \text{if} \ \ lvrat > 0.95 \end{cases} \end{align*}\]

Logistic Regression, Complete Model Specs

Dependent variable:
deny
OLS logistic probit
(1) (2) (3) (4) (5)
blackyes 0.084*** 0.688*** 0.389*** 0.371*** 0.363***
(0.023) (0.183) (0.099) (0.100) (0.101)
pirat 0.449*** 4.764*** 2.442*** 2.464*** 2.622***
(0.114) (1.332) (0.673) (0.654) (0.665)
hirat -0.048 -0.109 -0.185 -0.302 -0.502
(0.110) (1.298) (0.689) (0.689) (0.715)
lvratmedium 0.031** 0.464*** 0.214*** 0.216*** 0.215**
(0.013) (0.160) (0.082) (0.082) (0.084)
lvrathigh 0.189*** 1.495*** 0.791*** 0.795*** 0.836***
(0.050) (0.325) (0.183) (0.184) (0.185)
chist 0.031*** 0.290*** 0.155*** 0.158*** 0.344***
(0.005) (0.039) (0.021) (0.021) (0.108)
mhist 0.021* 0.279** 0.148** 0.110 0.162
(0.011) (0.138) (0.073) (0.076) (0.104)
phistyes 0.197*** 1.226*** 0.697*** 0.702*** 0.717***
(0.035) (0.203) (0.114) (0.115) (0.116)
insuranceyes 0.702*** 4.548*** 2.557*** 2.585*** 2.589***
(0.045) (0.576) (0.305) (0.299) (0.306)
selfempyes 0.060*** 0.666*** 0.359*** 0.346*** 0.342***
(0.021) (0.214) (0.113) (0.116) (0.116)
singleyes 0.229*** 0.230***
(0.080) (0.086)
hschoolyes -0.613*** -0.604**
(0.229) (0.237)
unemp 0.030* 0.028
(0.018) (0.018)
condominyes -0.055
(0.096)
I(mhist == 3) -0.107
(0.301)
I(mhist == 4) -0.383
(0.427)
I(chist == 3) -0.226
(0.248)
I(chist == 4) -0.251
(0.338)
I(chist == 5) -0.789*
(0.412)
I(chist == 6) -0.905*
(0.515)
Constant -0.183*** -5.707*** -3.041*** -2.575*** -2.896***
(0.028) (0.484) (0.250) (0.350) (0.404)
Observations 2,380 2,380 2,380 2,380 2,380
R2 0.266
Adjusted R2 0.263
Log Likelihood -635.637 -636.847 -628.614 -625.064
Akaike Inf. Crit. 1,293.273 1,295.694 1,285.227 1,292.129
Residual Std. Error 0.279 (df = 2369)
F Statistic 85.974*** (df = 10; 2369)
Note: p<0.1; p<0.05; p<0.01

Some Thoughts and Comparisons

Performance of Models

\[\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\]

\[\begin{align*} Y_i = \begin{cases} 1 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) > 0.5, \\ 0 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) < 0.5, \\ \end{cases} \end{align*}\]

Then \(Y_i\) can be assessed. The quality of the prediction is simply neglected other than creating two classes with probability threshold. This threshold may be set to another value(s) based on other measures, such as, Information Gain

For the threshold is set to \(0.5\), the mis-classification error is calculated as 0.1176

and the confusion matrix

     0   1
0 2089 274
1    6  11

ROC Curve

\(TPR=\frac{TP}{TP+FN} \: and \: FPR=\frac{FP}{FP+TN}\)

This is like assigning random positive to positive than random negative. Comparing

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} black)\)


Multinomial, Ordinal, Interval Dependent Variables

For a brief discussion about them please do read:

Multinomial Regression

Ordinal Regression

Interval Regression