Binary Classification Review:

Linear Probability Model

Logistic Regression

KNN

I. Ozkan

Spring 2025

Preliminary Readings

Review

Binary Dependent Variable

\(\{(y_1, X_1),(y_2, X_2),...,(y_n, X_n)\}\)

with dependent variable

\(y_i \in \{0,1\}\)

and covariates

\(X_i=(x_{i1},x_{i2},..,x_{ik})\).

\[E(Y\vert X_1,X_2,\dots,X_k) = P(Y=1\vert X_1, X_2,\dots, X_k)\]

and

\[P(y_i = 1 \vert x_{i1}, x_{i2}, \dots, x_{ik}) = \beta_0 + \beta_1 + x_{i1} + \beta_2 x_{i2} + \dots + \beta_k x_{ik}\]

Binary Dependent Variable: Linear Probability Model

  deny pirat hirat     lvrat chist mhist phist unemp selfemp insurance condomin
1   no 0.221 0.221 0.8000000     5     2    no   3.9      no        no       no
2   no 0.265 0.265 0.9218750     2     2    no   3.2      no        no       no
3   no 0.372 0.248 0.9203980     1     2    no   3.2      no        no       no
4   no 0.320 0.250 0.8604651     1     2    no   4.3      no        no       no
5   no 0.360 0.350 0.6000000     1     1    no   3.2      no        no       no
6   no 0.240 0.170 0.5105263     1     1    no   3.9      no        no       no
  afam single hschool
1   no     no     yes
2   no    yes     yes
3   no     no     yes
4   no     no     yes
5   no     no     yes
6   no     no     yes

The usual graph to start with shown below (it shows only 3 variables, deny, pirat and afam):

Binary Dependent Variable: Linear Probability Model

\[deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + \beta_2 \times afam_i + u_i\]

Dependent variable:
deny
pirat 0.559***
(0.060)
afamyes 0.177***
(0.018)
Constant -0.091***
(0.021)
Observations 2,380
R2 0.076
Adjusted R2 0.075
Residual Std. Error 0.312 (df = 2377)
F Statistic 97.760*** (df = 2; 2377)
Note: p<0.1; p<0.05; p<0.01

And the robust standard errors

\[\widehat{deny} = \, -\underset{(0.033)}{0.091} + \underset{(0.104)}{0.559} (P/I \ ratio) + \underset{(0.025)}{0.177} \cdot black\]

Probit Regression

In probit regression:

\[E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\]

\(\Phi(.)\) is the cumulative distribution function, or,

\[Y = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\]

where, \(\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)\)

and hence,

\[\Phi^{-1}(Y) = \beta_0 + \beta_1 x+ \cdots + \beta_k x_k\]

Probit Regression

Probit Regression

Dependent variable:
deny
pirat 2.968***
(0.386)
Constant -2.194***
(0.138)
Observations 2,380
Log Likelihood -831.792
Akaike Inf. Crit. 1,667.585
Note: p<0.1; p<0.05; p<0.01

Call:
glm(formula = deny ~ pirat, family = binomial(link = "probit"), 
    data = HMDA)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.1941     0.1378 -15.927  < 2e-16 ***
pirat         2.9679     0.3858   7.694 1.43e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1744.2  on 2379  degrees of freedom
Residual deviance: 1663.6  on 2378  degrees of freedom
AIC: 1667.6

Number of Fisher Scoring iterations: 6

\(\text{deviance} = -2 \times \left[\ln(\hat f^{}_{s}) - \ln(\hat f^{}_{model}) \right]\) where \(\hat f^{}_{s}\) is the maximized likelihood for a model which assumes that each observation has its own parameters.

\[\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\]

where \(\hat f^{}_{null}\) is a model with just an intercept. This can be calculated to assess the model.

From the summary table above,

change in deviance= 1744.2 - 1663.6=80.6 and df=2379-2378=1 and the p-value is 2.7833398^{-19}

Probit Regression


z test of coefficients:

            Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -2.19415    0.18901 -11.6087 < 2.2e-16 ***
pirat        2.96787    0.53698   5.5269 3.259e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ \widehat{P(deny\vert P/I \ ratio}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} (P/I \ ratio))\]

Probit Regression

Dependent variable:
deny
pirat 2.742***
(0.380)
afamyes 0.708***
(0.083)
Constant -2.259***
(0.137)
Observations 2,380
Log Likelihood -797.136
Akaike Inf. Crit. 1,600.272
Note: p<0.1; p<0.05; p<0.01

z test of coefficients:

             Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -2.258787   0.176608 -12.7898 < 2.2e-16 ***
pirat        2.741779   0.497673   5.5092 3.605e-08 ***
afamyes      0.708155   0.083091   8.5227 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[\widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} (P/I \ ratio) + \underset{(0.08)}{0.71} \cdot afam)\]

Logistic Regression (Logit Regression)

Probability and odds

\(\text{probability}=\frac {N_{Y=1}}{N}\)

\(P(X\leq2)=\frac{2}{6} \implies odds=\frac{2/6}{1-2/6}=\frac{2}{4}\)

\(\text{odds}=\frac {\text{Frequency of Y=1}}{\text{Frequency of Y} \neq 1}\)

\(\implies \text{odds}=\frac {\text{(Frequency of Y=1)}/N}{\text{(Frequency of Y} \neq 1)/N}\)

\(\implies \text{odds}=\frac{\text{probability}}{1-\text{probability}}\)

\(\implies \text{probability}=\frac{\text{odds}}{1+\text{odds}}\)

Logistic Regression

\(F(x) = \frac{1}{1+e^{-x}}\)

Logistic Regression

The model,

\(\begin{align*} P(Y=1\vert X_1, X_2, \dots, X_k) =& \, F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k) \\ =& \, \frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}. \end{align*}\)

To understand this, let’s start with log odds

\(l=ln \left( \frac{P(Y=1)}{1-P(Y=1)} \right)= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\)

\(\implies P(Y=1)= \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}{1+e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}\)

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio))\)

and

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio) + \beta_2black)\)

Dependent variable:
deny
(1) (2)
pirat 5.884*** 5.370***
(0.734) (0.728)
afamyes 1.273***
(0.146)
Constant -4.028*** -4.126***
(0.269) (0.268)
Observations 2,380 2,380
Log Likelihood -830.094 -795.695
Akaike Inf. Crit. 1,664.188 1,597.390
Note: p<0.1; p<0.05; p<0.01

z test of coefficients:

            Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -4.02843    0.35898 -11.2218 < 2.2e-16 ***
pirat        5.88450    1.00015   5.8836 4.014e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

z test of coefficients:

            Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -4.12556    0.34597 -11.9245 < 2.2e-16 ***
pirat        5.37036    0.96376   5.5723 2.514e-08 ***
afamyes      1.27278    0.14616   8.7081 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic Regression (Model Fit)

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} \cdot afam)\)

OddsRatio 2.5 % 97.5 %
(Intercept) 0.018 0.010 0.030
pirat 359.422 88.342 1565.122
OddsRatio 2.5 % 97.5 %
(Intercept) 0.016 0.009 0.027
pirat 214.941 53.848 931.657
afamyes 3.571 2.675 4.747

\[P(Y=1|(P/Iratio=0.3, black=1))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3 + 1.27)}} \approx 0.224\]

\[P(Y=1|(P/Iratio=0.3, black=0))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3)}} \approx 0.075\]

There is a \(0.149\) probability difference of being rejected if the applicant is African American.

Logistic Regression

Some Thoughts and Comparisons

Performance of Models

\(\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\)

\(\begin{align*} Y_i = \begin{cases} 1 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) > 0.5, \\ 0 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) < 0.5, \\ \end{cases} \end{align*}\)

Then \(Y_i\) can be assessed. The quality of the prediction is simply neglected other than creating two classes with probability threshold. This threshold may be set to another value(s) based on other measures, such as, Information Gain

For the threshold is set to \(0.5\), the mis-classification error is calculated as 0.1176

and the confusion matrix

     0   1
0 2089 274
1    6  11

ROC Curve

\(TPR=\frac{TP}{TP+FN} \: and \: FPR=\frac{FP}{FP+TN}\)

This is like assigning random positive to positive than random negative. Comparing

\(\widehat{P(deny=1 \vert P/I ratio)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)

\[\widehat{P(deny=1 \vert P/I ratio, afam)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} \cdot afam)\]

More: Read the following nice tutorial:
https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/

Example McFadden or pseudo R²
interpretation
https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation

\(1 - LL_{mod} / LL_0\) \(LL\) is always negative very good if \(0.2< \rho^2 <0.4\) Excellent > 0.5

Here are some of the measures:

Probit Regression Model
fitting null model for pseudo-r2
          llh       llhNull            G2      McFadden          r2ML 
-797.13603842 -872.08530450  149.89853216    0.08594259    0.06104017 
         r2CU 
   0.11750696 


Logistic Regression Model
fitting null model for pseudo-r2
          llh       llhNull            G2      McFadden          r2ML 
-795.69520837 -872.08530450  152.78019227    0.08759475    0.06217635 
         r2CU 
   0.11969421 

Some More Notes

Linearity

Multicollinearity


Call:
glm(formula = deny ~ pirat + hirat + lvrat + unemp + afam, family = binomial(link = "logit"), 
    data = HMDA)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.34016    0.47195 -13.434  < 2e-16 ***
pirat        5.77636    0.95106   6.074 1.25e-09 ***
hirat       -1.17531    1.10280  -1.066  0.28654    
lvrat        2.69435    0.44695   6.028 1.66e-09 ***
unemp        0.08117    0.03029   2.680  0.00736 ** 
afamyes      1.15905    0.15062   7.695 1.41e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1744.2  on 2379  degrees of freedom
Residual deviance: 1544.6  on 2374  degrees of freedom
AIC: 1556.6

Number of Fisher Scoring iterations: 5


Variance Inflation Factors
   pirat    hirat    lvrat    unemp     afam 
1.633048 1.635978 1.023026 1.019203 1.037514 

Outlier Effects