Binary Dependent Variable - Classification

I. Ozkan

Fall 2025

Preliminary Readings

Introduction to Econometrics with R, Christoph Hanck, Martin Arnold, Alexander Gerber, and Martin Schmelzer, Chapters:11
An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 4
Using R for Introductory Econometrics, Florian Heiss, Chapter 17, pp:253-260

Learning Objectives

Linear Probability Model
Probit Regression
Logit Regression

Binary Dependent Variable

Let observations are

\(\{(y_1, X_1),(y_2, X_2),...,(y_n, X_n)\}\)

with dependent variable

\(y_i \in \{0,1\}\)

and covariates

\(X_i=(x_{i1},x_{i2},..,x_{ik})\).

Linear Probability Model is then:

\(E(Y\vert X_1,X_2,\dots,X_k) = P(Y=1\vert X_1, X_2,\dots, X_k)\)

and

\(P(y_i = 1 \vert x_{i1}, x_{i2}, \dots, x_{ik}) = \beta_0 + \beta_1 + x_{i1} + \beta_2 x_{i2} + \dots + \beta_k x_{ik}\)

The \(\beta_j\) is interpreted as the change in the probability that \(y_i=1\) holding other variables are constants
Notes: \(R^2\) has no meaning and \(\varepsilon\) is always heteroskedastic (hence robust standard errors should be considered)

Binary Dependent Variable: Linear Probability Model

Let’s use AER package Home Mortgage Disclosure Act Data. See Introduction to Econometrics with R Book, Section 11
The variable deny is a binary variable that indicates the application is either denied (deny=yes) or accepted (deny=no).
deny is first modeled with one explanatory variable, pirat, the ratio of expected monthly loan payment to the applicants income and then it is modeled with adding afam (African American) variable later
The usual graph to start with shown next (it shows only 3 variables, deny, pirat and afam):

Binary Dependent Variable: Linear Probability Model

deny	pirat	afam
Home Mortgage Disclosure Act Data
no	0.221	no
no	0.265	no
no	0.372	no
no	0.320	no
no	0.360	no
no	0.240	no
HMDA Dataset is part of AER Package

Binary Dependent Variable: Linear Probability Model

\(deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + u_i\)


	Dependent variable:

	deny

pirat	0.604^***
	(0.061)

Constant	-0.080^***
	(0.021)


Observations	2,380
R²	0.040
Adjusted R²	0.039
Residual Std. Error	0.318 (df = 2378)
F Statistic	98.406^*** (df = 1; 2378)

Note:	p<0.1; p<0.05; p<0.01

\(\widehat{deny} = -\underset{(0.032)}{0.080} + \underset{(0.098)}{0.604} (P/I \ ratio)\)

Binary Dependent Variable

Following the book Applied Econometrics with R (AER), let’s add, the afam variable which indicates if the applicant is African American as another covariate

\(deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + + \beta_2 \times black+ \varepsilon_i\)


	Dependent variable:

	deny

pirat	0.559^***
	(0.060)

blackyes	0.177^***
	(0.018)

Constant	-0.091^***
	(0.021)


Observations	2,380
R²	0.076
Adjusted R²	0.075
Residual Std. Error	0.312 (df = 2377)
F Statistic	97.760^*** (df = 2; 2377)

Note:	p<0.1; p<0.05; p<0.01

And the robust standard errors

\(\widehat{deny} = \, -\underset{(0.033)}{0.091} + \underset{(0.104)}{0.559} (P/I \ ratio) + \underset{(0.025)}{0.177} black\)

The new variable (black) has a positive effect. It indicates that there exist an explanatory power.
The weakness of the linear probability model is in it’s assumption: The conditional Probability function is linear. This assumption does not limit the probability between \(0\) and \(1\).
Probit and Logit Regression Models are commonly used to overcome this weakness

Probit Regression

Recall that, \(y_i\) takes only \(0\) or \(1\), \(y_i \in \{0,1\}\), one way to overcome the weakness of the linear probability model is to transform \(y \implies F(y)\) such that, \(Y \in (-\infty, +\infty)\).
The transformation of the dependent variable is generally performed with so called link function.
Since, \(y_i\) is treated as probability, \(P(y_i)=1, y_i=1; \; P(y_i)=0, y_i=0\), inverse cumulative distribution function can transform \(y_i\) to a continuous variable

In probit regression:

\(E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\)

\(\Phi(.)\) is the cumulative distribution function, or,

\(Y = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\)

where, \(\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)\)

and hence,

\(\Phi^{-1}(Y) = \beta_0 + \beta_1 x+ \cdots + \beta_k x_k\)

The link function \(F(y)=\Phi^{-1}(Y)\) is called probit link.

Probit Regression

Let \(z=X_i^T\beta=1\) and the probability, \(P(y_i=1)\) is;

Let \(z=X_i^T\beta=-1.5\) the probability, \(P(y_i=1)\) is;

Probit Regression

Using the same dataset, let’s fit the probit model without African-American variable;


	Dependent variable:

	deny

pirat	2.968^***
	(0.386)

Constant	-2.194^***
	(0.138)


Observations	2,380
Log Likelihood	-831.792
Akaike Inf. Crit.	1,667.585

Note:	p<0.1; p<0.05; p<0.01


Call:
glm(formula = deny ~ pirat, family = binomial(link = "probit"), 
    data = HMDA)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.1941     0.1378 -15.927  < 2e-16 ***
pirat         2.9679     0.3858   7.694 1.43e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1744.2  on 2379  degrees of freedom
Residual deviance: 1663.6  on 2378  degrees of freedom
AIC: 1667.6

Number of Fisher Scoring iterations: 6

From the output, Deviance residual is used a measure of the model fit. Deviance is calculated as:

\(\text{deviance} = -2 \times \left[\ln(\hat f^{}_{s}) - \ln(\hat f^{}_{model}) \right]\) where \(\hat f^{}_{s}\) is the maximized likelihood for a model which assumes that each observation has its own parameters.

The pseudo-R² which is given as,

\(\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\)

where \(\hat f^{}_{null}\) is a model with just an intercept. This can be calculated to assess the model.

Another way to assess the model fit is to test the difference between the residual deviance for the model and the \(null\) model. The test statistic is distributed \(\chi^2\) with degrees of freedom between the model and the \(null\)

From the summary table above,

change in deviance= 1744.2 - 1663.6=80.6 and df=2379-2378=1 and the p-value is 2.7833398^{-19}

Probit Regression

\(\widehat{P(deny\vert P/I \ ratio}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} (P/I \ ratio))\)

Probit Regression

With African American, black variable


	Dependent variable:

	deny

pirat	2.742^***
	(0.380)

blackyes	0.708^***
	(0.083)

Constant	-2.259^***
	(0.137)


Observations	2,380
Log Likelihood	-797.136
Akaike Inf. Crit.	1,600.272

Note:	p<0.1; p<0.05; p<0.01

\(\widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} (P/I \ ratio) + \underset{(0.08)}{0.71} black)\)

Logistic Regression (Logit Regression)

\(y_i \in \{0,1\}\), can be transformed to \(y \implies F(y)\) such that,

\(Y \in [0, +\infty) \implies ln(Y) \in (-\infty,+\infty)\)

To do so, odds is used
Assumption: linear relationship between the predictor variables, and the log-odds of the event that \(\displaystyle Y=1\)

Probability and odds

Probability is the ratio of the element in the event to the all possible element in the sample space

\(\text{probability}=\frac {N_{Y=1}}{N}\)

Odds is the ratio of relative frequency (generally given in favor to the event happening), as an example, the odds of having 2 or less of tossing a die is,

\(P(X\leq2)=\frac{2}{6} \implies odds=\frac{2/6}{1-2/6}=\frac{2}{4}\)

\(\text{odds}=\frac {\text{Frequency of Y=1}}{\text{Frequency of Y} \neq 1}\)

\(\implies \text{odds}=\frac {\text{(Frequency of Y=1)}/N}{\text{(Frequency of Y} \neq 1)/N}\)

\(\implies \text{odds}=\frac{\text{probability}}{1-\text{probability}}\)

\(\implies \text{probability}=\frac{\text{odds}}{1+\text{odds}}\)

Logistic Regression

Odds ratio is the ratio of odds.
Similar to Probit Regression. The cumulative distribution function, CDF, is different for Logit Regression.
PDF:

\(F(x) = \frac{1}{1+e^{-x}}\)

Logistic Regression

The model,

\[\begin{align*} P(Y=1\vert X_1, X_2, \dots, X_k) =& \, F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k) \\ =& \, \frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}. \end{align*}\]

To understand this, let’s start with log odds

\[l=ln \left( \frac{P(Y=1)}{1-P(Y=1)} \right)= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\]

\[\implies P(Y=1)= \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}{1+e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}\]

Similar to Probit regression, the result of fitting the both model and obtaining robust coefficients are shown below

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio))\)

and

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio) + \beta_2black)\)


	Dependent variable:

	deny
	(1)	(2)

pirat	5.884^***	5.370^***
	(0.734)	(0.728)

blackyes		1.273^***
		(0.146)

Constant	-4.028^***	-4.126^***
	(0.269)	(0.268)


Observations	2,380	2,380
Log Likelihood	-830.094	-795.695
Akaike Inf. Crit.	1,664.188	1,597.390

Note:	p<0.1; p<0.05; p<0.01

Logistic Regression (Model Fit)

\[\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\]

With African American Variable

\[\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} black)\]

For the interpretations, one must exponentiate the coefficients and finds out how the odds of being denied versus accepted. The intercept is not generally interpreted

	OddsRatio	2.5 %	97.5 %
(Intercept)	0.018	0.010	0.030
pirat	359.422	88.342	1565.122

	OddsRatio	2.5 %	97.5 %
(Intercept)	0.016	0.009	0.027
pirat	214.941	53.848	931.657
blackyes	3.571	2.675	4.747

For a one unit increase in \(\text{P/I ratio}\), the odds of being denied to being accepted increase by a factor of 359.42 (without African American variable) or 214.94 (with African American variable)
Being black increases the odds of being rejected by a factor of 3.57 (or change the log odds of being denied by 1.273)
To follow the ISLR book, let’s calculate what is the probability of denial difference for black and white given (P/I ratio)=0.3,

\[P(Y=1|(P/Iratio=0.3, black=1))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3 + 1.27)}} \approx 0.224\]

\[P(Y=1|(P/Iratio=0.3, black=0))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3)}} \approx 0.075\]

There is a \(0.149\) probability difference of being rejected if the applicant is African American.

Logistic Regression

Logistic Regression, Complete Model Specs

All of the above model specifications, majority of the variables are omitted
We may want to create a near full model
To follow the book, lvrat (loan to value ratio) variable is converted to

\[\begin{align*} lvrat = \begin{cases} \text{low} & \text{if} \ \ lvrat < 0.8, \\ \text{medium} & \text{if} \ \ 0.8 \leq lvrat \leq 0.95, \\ \text{high} & \text{if} \ \ lvrat > 0.95 \end{cases} \end{align*}\]

Read the corresponding section for more information

Logistic Regression, Complete Model Specs


	Dependent variable:

	deny
	OLS	logistic	probit
	(1)	(2)	(3)	(4)	(5)

blackyes	0.084^***	0.688^***	0.389^***	0.371^***	0.363^***
	(0.023)	(0.183)	(0.099)	(0.100)	(0.101)

pirat	0.449^***	4.764^***	2.442^***	2.464^***	2.622^***
	(0.114)	(1.332)	(0.673)	(0.654)	(0.665)

hirat	-0.048	-0.109	-0.185	-0.302	-0.502
	(0.110)	(1.298)	(0.689)	(0.689)	(0.715)

lvratmedium	0.031^**	0.464^***	0.214^***	0.216^***	0.215^**
	(0.013)	(0.160)	(0.082)	(0.082)	(0.084)

lvrathigh	0.189^***	1.495^***	0.791^***	0.795^***	0.836^***
	(0.050)	(0.325)	(0.183)	(0.184)	(0.185)

chist	0.031^***	0.290^***	0.155^***	0.158^***	0.344^***
	(0.005)	(0.039)	(0.021)	(0.021)	(0.108)

mhist	0.021^*	0.279^**	0.148^**	0.110	0.162
	(0.011)	(0.138)	(0.073)	(0.076)	(0.104)

phistyes	0.197^***	1.226^***	0.697^***	0.702^***	0.717^***
	(0.035)	(0.203)	(0.114)	(0.115)	(0.116)

insuranceyes	0.702^***	4.548^***	2.557^***	2.585^***	2.589^***
	(0.045)	(0.576)	(0.305)	(0.299)	(0.306)

selfempyes	0.060^***	0.666^***	0.359^***	0.346^***	0.342^***
	(0.021)	(0.214)	(0.113)	(0.116)	(0.116)

singleyes				0.229^***	0.230^***
				(0.080)	(0.086)

hschoolyes				-0.613^***	-0.604^**
				(0.229)	(0.237)

unemp				0.030^*	0.028
				(0.018)	(0.018)

condominyes					-0.055
					(0.096)

I(mhist == 3)					-0.107
					(0.301)

I(mhist == 4)					-0.383
					(0.427)

I(chist == 3)					-0.226
					(0.248)

I(chist == 4)					-0.251
					(0.338)

I(chist == 5)					-0.789^*
					(0.412)

I(chist == 6)					-0.905^*
					(0.515)

Constant	-0.183^***	-5.707^***	-3.041^***	-2.575^***	-2.896^***
	(0.028)	(0.484)	(0.250)	(0.350)	(0.404)


Observations	2,380	2,380	2,380	2,380	2,380
R²	0.266
Adjusted R²	0.263
Log Likelihood		-635.637	-636.847	-628.614	-625.064
Akaike Inf. Crit.		1,293.273	1,295.694	1,285.227	1,292.129
Residual Std. Error	0.279 (df = 2369)
F Statistic	85.974^*** (df = 10; 2369)

Note:	p<0.1; p<0.05; p<0.01

Some Thoughts and Comparisons

Logistic Regression and Probit Regression produce similar results. When to use logistic regression depends on the preference of researcher
Linear Probability model violate the homoskedasticity and normality of errors assumptions of linear regression. Standard errors are invalid (and hence the hypothesis tests)
If the number of success relative to number of unsuccess is very small then the stability of the model is questionable
Both Logit and Probit regression require more observations than standard regression because maximum likelihood estimation is used for both of them
Pseudo-R-squared exists but the interpretation is different than the OLS regression. See UCLA IDRE web site for more details

Performance of Models

Performance Measures obtained using probit or logit model are similar
Since \(R^2\) is invalid as a measure of fit, \(pseudo-R^2\) may be used (as explained before)

\[\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\]

Another way to assess the fit is to use a probability threshold, for example, \(0.5\) for assigning the estimates

\[\begin{align*} Y_i = \begin{cases} 1 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) > 0.5, \\ 0 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) < 0.5, \\ \end{cases} \end{align*}\]

Then \(Y_i\) can be assessed. The quality of the prediction is simply neglected other than creating two classes with probability threshold. This threshold may be set to another value(s) based on other measures, such as, Information Gain

For the threshold is set to \(0.5\), the mis-classification error is calculated as 0.1176

and the confusion matrix

     0   1
0 2089 274
1    6  11

The optimal cutoff can be found by minimizing True Positives, True Negatives, Both or mis-classification error

ROC Curve

ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters True Positive Rate, TPR vs False Positive Rate, FPR.

\(TPR=\frac{TP}{TP+FN} \: and \: FPR=\frac{FP}{FP+TN}\)

This is like assigning random positive to positive than random negative. Comparing

The ROC curve for the logit models fitted above:

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} black)\)

Multinomial, Ordinal, Interval Dependent Variables

For a brief discussion about them please do read:

Multinomial Regression

Ordinal Regression

Interval Regression