Binary Dependent Variable

Let observations are

\(\{(y_1, X_1),(y_2, X_2),...,(y_n, X_n)\}\)

with dependent variable

\(y_i \in \{0,1\}\)

and covariates

\(X_i=(x_{i1},x_{i2},..,x_{ik})\).

Linear Probability Model is then:

\[E(Y\vert X_1,X_2,\dots,X_k) = P(Y=1\vert X_1, X_2,\dots, X_k)\]

and

\[P(y_i = 1 \vert x_{i1}, x_{i2}, \dots, x_{ik}) = \beta_0 + \beta_1 + x_{i1} + \beta_2 x_{i2} + \dots + \beta_k x_{ik}\]

The \(\beta_j\) is interpreted as the change in the probability that \(y_i=1\) holding other variables are constants.
Notes: \(R^2\) has no meaning and \(\varepsilon\) is always heteroskedastic (hence robust standard errors should be considered).

Binary Dependent Variable: Linear Probability Model

Let’s use Mortgage data available with AER package. See Introduction to Econometrics with R Book, Section 11
The first few observations are:

  deny pirat hirat     lvrat chist mhist phist unemp selfemp insurance condomin
1   no 0.221 0.221 0.8000000     5     2    no   3.9      no        no       no
2   no 0.265 0.265 0.9218750     2     2    no   3.2      no        no       no
3   no 0.372 0.248 0.9203980     1     2    no   3.2      no        no       no
4   no 0.320 0.250 0.8604651     1     2    no   4.3      no        no       no
5   no 0.360 0.350 0.6000000     1     1    no   3.2      no        no       no
6   no 0.240 0.170 0.5105263     1     1    no   3.9      no        no       no
  afam single hschool
1   no     no     yes
2   no    yes     yes
3   no     no     yes
4   no     no     yes
5   no     no     yes
6   no     no     yes

The variable deny is a binary variable that indicates the application is either denied (deny=yes) or accepted (deny=no).
deny is first modeled with one explanatory variable, pirat, the ratio of expected monthly loan payment to the applicants income and then it is modeled with adding afam (African American) variable later

The usual graph to start with shown below (it shows only 3 variables, deny, pirat and afam):

Binary Dependent Variable: Linear Probability Model

\[deny_i = \beta_0 + \beta_1 \times (P/I\ ratio)_i + \beta_2 \times afam_i + u_i\]

Following the book Applied Econometrics with R (AER), let’s add, the afam variable which indicates if the applicant is African American as another covariate.


	Dependent variable:

	deny

pirat	0.559^***
	(0.060)

afamyes	0.177^***
	(0.018)

Constant	-0.091^***
	(0.021)


Observations	2,380
R²	0.076
Adjusted R²	0.075
Residual Std. Error	0.312 (df = 2377)
F Statistic	97.760^*** (df = 2; 2377)

Note:	p<0.1; p<0.05; p<0.01

And the robust standard errors

\[\widehat{deny} = \, -\underset{(0.033)}{0.091} + \underset{(0.104)}{0.559} (P/I \ ratio) + \underset{(0.025)}{0.177} \cdot black\]

The weakness of the linear probability model is in it’s assumption: The conditional Probability function is linear. This assumption does not limit the probability between \(0\) and \(1\).
Many model assumptions do not hold (homoschedasticity of errors, normality of errors, etc.)
Probit and Logit Regression Models are commonly used to overcome this weakness

Probit Regression

Recall that, \(y_i\) takes only \(0\) or \(1\), \(y_i \in \{0,1\}\), one way to overcome the weakness of the linear probability model is to transform \(y \implies F(y)\) such that, \(Y \in (-\infty, +\infty)\).
The transformation of the dependent variable is generally performed with so called link function.
Since, \(y_i\) is treated as probability, \(P(y_i)=1, y_i=1; \; P(y_i)=0, y_i=0\), inverse cumulative distribution function can transform \(y_i\) to a continuous variable

In probit regression:

\[E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\]

\(\Phi(.)\) is the cumulative distribution function, or,

\[Y = \Phi(\beta_0 + \beta_1 x_1+ \cdots + \beta_k x_k)\]

where, \(\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)\)

and hence,

\[\Phi^{-1}(Y) = \beta_0 + \beta_1 x+ \cdots + \beta_k x_k\]

The link function \(F(y)=\Phi^{-1}(Y)\) is called probit link.

Probit Regression

As an example, let \(z=X_i^T\beta=1\) the the probability, \(P(y_i=1)\) is;

As another example, let \(z=X_i^T\beta=-1.5\) the the probability, \(P(y_i=1)\) is;

Probit Regression

We fit the model without afam variable, the following estimates obtained:


	Dependent variable:

	deny

pirat	2.968^***
	(0.386)

Constant	-2.194^***
	(0.138)


Observations	2,380
Log Likelihood	-831.792
Akaike Inf. Crit.	1,667.585

Note:	p<0.1; p<0.05; p<0.01

And the output of the summary of the model


Call:
glm(formula = deny ~ pirat, family = binomial(link = "probit"), 
    data = HMDA)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.1941     0.1378 -15.927  < 2e-16 ***
pirat         2.9679     0.3858   7.694 1.43e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1744.2  on 2379  degrees of freedom
Residual deviance: 1663.6  on 2378  degrees of freedom
AIC: 1667.6

Number of Fisher Scoring iterations: 6

From the output, Deviance residual is used a measure of the model fit. Deviance is calculated as:

\(\text{deviance} = -2 \times \left[\ln(\hat f^{}_{s}) - \ln(\hat f^{}_{model}) \right]\) where \(\hat f^{}_{s}\) is the maximized likelihood for a model which assumes that each observation has its own parameters.

The pseudo-R² which is given as,

\[\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\]

where \(\hat f^{}_{null}\) is a model with just an intercept. This can be calculated to assess the model.

Another way to assess the model fit is to test the difference between the residual deviance for the model and the \(null\) model. The test statistic is distributed \(\chi^2\) with degrees of freedom between the model and the \(null\)

From the summary table above,

change in deviance= 1744.2 - 1663.6=80.6 and df=2379-2378=1 and the p-value is 2.7833398^{-19}

Probit Regression


z test of coefficients:

            Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -2.19415    0.18901 -11.6087 < 2.2e-16 ***
pirat        2.96787    0.53698   5.5269 3.259e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ \widehat{P(deny\vert P/I \ ratio}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} (P/I \ ratio))\]

Probit Regression

With African American information


	Dependent variable:

	deny

pirat	2.742^***
	(0.380)

afamyes	0.708^***
	(0.083)

Constant	-2.259^***
	(0.137)


Observations	2,380
Log Likelihood	-797.136
Akaike Inf. Crit.	1,600.272

Note:	p<0.1; p<0.05; p<0.01


z test of coefficients:

             Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -2.258787   0.176608 -12.7898 < 2.2e-16 ***
pirat        2.741779   0.497673   5.5092 3.605e-08 ***
afamyes      0.708155   0.083091   8.5227 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[\widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} (P/I \ ratio) + \underset{(0.08)}{0.71} \cdot afam)\]

African Americans have a higher probability of denial than white applicants

Logistic Regression (Logit Regression)

\(y_i \in \{0,1\}\), can be transformed to \(y \implies F(y)\) such that, \(Y \in [0, +\infty) \implies ln(Y) \in (-\infty,+\infty)\)
To do so, odds is used
Assumption: linear relationship between the predictor variables, and the log-odds of the event that \(\displaystyle Y=1\)

Probability and odds

Probability is the ratio of the element in the event to the all possible element in the sample space

\(\text{probability}=\frac {N_{Y=1}}{N}\)

Odds is the ratio of relative frequency (generally given in favor to the event happening), as an example, the odds of having 2 or less of tossing a die is,

\(P(X\leq2)=\frac{2}{6} \implies odds=\frac{2/6}{1-2/6}=\frac{2}{4}\)

\(\text{odds}=\frac {\text{Frequency of Y=1}}{\text{Frequency of Y} \neq 1}\)

\(\implies \text{odds}=\frac {\text{(Frequency of Y=1)}/N}{\text{(Frequency of Y} \neq 1)/N}\)

\(\implies \text{odds}=\frac{\text{probability}}{1-\text{probability}}\)

\(\implies \text{probability}=\frac{\text{odds}}{1+\text{odds}}\)

Logistic Regression

Odds ratio is the ratio of odds.
Similar to Probit Regression. The cumulative distribution function, CDF, is different for Logit Regression.
CDF:

\(F(x) = \frac{1}{1+e^{-x}}\)

Logistic Regression

The model,

\(\begin{align*} P(Y=1\vert X_1, X_2, \dots, X_k) =& \, F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k) \\ =& \, \frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}. \end{align*}\)

To understand this, let’s start with log odds

\(l=ln \left( \frac{P(Y=1)}{1-P(Y=1)} \right)= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\)

\(\implies P(Y=1)= \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}{1+e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k}}=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}\)

Similar to Probit regression, the result of fitting the both model and obtaining robust coefficients are shown below

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio))\)

and

\(P(deny=1 \vert P/I ratio, black) = F(\beta_0 + \beta_1(P/I \ ratio) + \beta_2black)\)


	Dependent variable:

	deny
	(1)	(2)

pirat	5.884^***	5.370^***
	(0.734)	(0.728)

afamyes		1.273^***
		(0.146)

Constant	-4.028^***	-4.126^***
	(0.269)	(0.268)


Observations	2,380	2,380
Log Likelihood	-830.094	-795.695
Akaike Inf. Crit.	1,664.188	1,597.390

Note:	p<0.1; p<0.05; p<0.01


z test of coefficients:

            Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -4.02843    0.35898 -11.2218 < 2.2e-16 ***
pirat        5.88450    1.00015   5.8836 4.014e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


z test of coefficients:

            Estimate Std. Error  z value  Pr(>|z|)    
(Intercept) -4.12556    0.34597 -11.9245 < 2.2e-16 ***
pirat        5.37036    0.96376   5.5723 2.514e-08 ***
afamyes      1.27278    0.14616   8.7081 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic Regression (Model Fit)

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)

With African American Variable

\(\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} \cdot afam)\)

For the interpretations, one must exponentiate the coefficients and finds out how the odds of being denied versus accepted. The intercept is not generally interpreted

	OddsRatio	2.5 %	97.5 %
(Intercept)	0.018	0.010	0.030
pirat	359.422	88.342	1565.122

	OddsRatio	2.5 %	97.5 %
(Intercept)	0.016	0.009	0.027
pirat	214.941	53.848	931.657
afamyes	3.571	2.675	4.747

For a one unit increase in \(\text{P/I ratio}\), the odds of being denied to being accepted increase by a factor of 359.42 (without African American variable) or 214.94 (with African American variable)
Being black increases the odds of being rejected by a factor of 3.57 (or change the log odds of being denied by 1.273)
To follow the book, let’s calculate what is the probability of denial difference for black and white given (P/I ratio)=0.3,

\[P(Y=1|(P/Iratio=0.3, black=1))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3 + 1.27)}} \approx 0.224\]

\[P(Y=1|(P/Iratio=0.3, black=0))= \frac{1}{1+e^{-(-4.13 + 5.37 \times 0.3)}} \approx 0.075\]

There is a \(0.149\) probability difference of being rejected if the applicant is African American.

Some Thoughts and Comparisons

Logistic Regression and Probit Regression produce similar results. When to use logistic regression depends on the preference of researcher
Linear Probability model violate the homoskedasticity and normality of errors assumptions of linear regression. Standard errors are invalid (and hence the hypothesis tests)
If the number of success relative to number of unsuccess is very small then the stability of the model is questionable
Both Logit and Probit regression require more observations than standard regression because maximum likelihood estimation is used for both of them
Pseudo-R-squared exists but the interpretation is different than the OLS regression. See UCLA IDRE web site for more details

Performance of Models

Measures for performance of a probit or ligit model are similar to performance of classification models
Since \(R^2\) is invalid as a measure of fit, \(pseudo-R^2\) may be used (as explained before)

\(\text{pseudo-}R^2 = 1 - \frac{logLik(\hat f^{}_{model})}{logLik(\hat f^{}_{null})}\)

Another way to assess the fit is to use a probability threshold, for example, \(0.5\) for assigning the estimates

\(\begin{align*} Y_i = \begin{cases} 1 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) > 0.5, \\ 0 & \text{if} \ \ \hat P(Y_i|X_{i1}, \dots, X_{ik}) < 0.5, \\ \end{cases} \end{align*}\)

Then \(Y_i\) can be assessed. The quality of the prediction is simply neglected other than creating two classes with probability threshold. This threshold may be set to another value(s) based on other measures, such as, Information Gain

For the threshold is set to \(0.5\), the mis-classification error is calculated as 0.1176

and the confusion matrix

     0   1
0 2089 274
1    6  11

The optimal cutoff can be found by minimizing True Positives, True Negatives, Both or mis-classification error

ROC Curve

ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters True Positive Rate, TPR vs False Positive Rate, FPR.

\(TPR=\frac{TP}{TP+FN} \: and \: FPR=\frac{FP}{FP+TN}\)

This is like assigning random positive to positive than random negative. Comparing

The ROC curve for the logit models fitted above:

\(\widehat{P(deny=1 \vert P/I ratio)} = F(-\underset{(0.27)}{4.03} + \underset{(0.73)}{5.88} (P/I \ ratio))\)

\[\widehat{P(deny=1 \vert P/I ratio, afam)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} (P/I \ ratio) + \underset{(0.15)}{1.27} \cdot afam)\]

More: Read the following nice tutorial:
https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/

Example McFadden or pseudo R²
interpretation
https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation

\(1 - LL_{mod} / LL_0\) \(LL\) is always negative very good if \(0.2< \rho^2 <0.4\) Excellent > 0.5

Here are some of the measures:

Probit Regression Model

fitting null model for pseudo-r2

          llh       llhNull            G2      McFadden          r2ML 
-797.13603842 -872.08530450  149.89853216    0.08594259    0.06104017 
         r2CU 
   0.11750696



Logistic Regression Model

fitting null model for pseudo-r2

          llh       llhNull            G2      McFadden          r2ML 
-795.69520837 -872.08530450  152.78019227    0.08759475    0.06217635 
         r2CU 
   0.11969421

Some More Notes

Linearity

In logistic regression, we assume the relationship is linear on the logit scale (link scale)
Diagnose the violations of linearity by plotting each predictor against its component-plus-residual
Add a linear fit of the points (a dashed red line in the below example), and smoothed conditional mean line (blue line)
If a linear fit is appropriate, the two lines will be close to each other
The systematic change (rise and/or fall) in the smoothed line suggests a nonlinear relationship
If there exists a nonlinear relationship then consider adding a new (or transformed) variable
Example: HMDA data, logistic regression

Multicollinearity

It is similar to linear model
Example given above has only one scalar variable. Hence new model will be created
Variance Inflation Factors (VIFs)


Call:
glm(formula = deny ~ pirat + hirat + lvrat + unemp + afam, family = binomial(link = "logit"), 
    data = HMDA)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.34016    0.47195 -13.434  < 2e-16 ***
pirat        5.77636    0.95106   6.074 1.25e-09 ***
hirat       -1.17531    1.10280  -1.066  0.28654    
lvrat        2.69435    0.44695   6.028 1.66e-09 ***
unemp        0.08117    0.03029   2.680  0.00736 ** 
afamyes      1.15905    0.15062   7.695 1.41e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1744.2  on 2379  degrees of freedom
Residual deviance: 1544.6  on 2374  degrees of freedom
AIC: 1556.6

Number of Fisher Scoring iterations: 5



Variance Inflation Factors

   pirat    hirat    lvrat    unemp     afam 
1.633048 1.635978 1.023026 1.019203 1.037514

Outlier Effects

DFFITS and DFBETAS can be used to detect influential observations
DFFITS (for each observation), measuring how much prediction for that point would change
DFBETAS (for each predictor)

DFBETAS (for each predictor)

Binary Classification Review:

Linear Probability Model

Logistic Regression

KNN

Preliminary Readings

Review

Binary Dependent Variable

Binary Dependent Variable: Linear Probability Model

Binary Dependent Variable: Linear Probability Model

Probit Regression

Probit Regression

Probit Regression

Probit Regression

Probit Regression

Logistic Regression (Logit Regression)

Logistic Regression

Logistic Regression

Logistic Regression (Model Fit)

Logistic Regression

Some Thoughts and Comparisons

Performance of Models

ROC Curve

Some More Notes

Binary Classification Review: Linear Probability Model Logistic Regression KNN

Preliminary Readings

Review

Binary Dependent Variable

Binary Dependent Variable: Linear Probability Model

Binary Dependent Variable: Linear Probability Model

Probit Regression

Probit Regression

Probit Regression

Probit Regression

Probit Regression

Logistic Regression (Logit Regression)

Logistic Regression

Logistic Regression

Logistic Regression (Model Fit)

Logistic Regression

Some Thoughts and Comparisons

Performance of Models

ROC Curve

Some More Notes

Binary Classification Review:

Linear Probability Model

Logistic Regression

KNN