Decision Tree

I. Ozkan

Spring 2025

Objectives

Reading Suggestions

Book Chapters

Internet Resources

KeyWords

Example: Titanic Data (Classification Tree)

Classification Tree Example (FFTrees Package)

Example: Boston Housing Data

Example: Boston Housing Data

Linear Regression: Boston Housing Data


Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Example: Boston Housing Data (Regression Tree)

Example: Teaching Ratings

Data on course evaluations, course characteristics, and professor characteristics for 463 courses for the academic years 2000–2002 at the University of Texas at Austin. Data is available from R AES package.

Example: Teaching Ratings

Example: Teaching Ratings

Decision Tree

Decision Tree

Rule 1:

If rm < 6.9 and lstat >= 14 and crim >= 7 then medv = 12

Rule 2:

If rm >= 6.941 then medv = 45.1

… etc.

Importance:

rm is the most important variable, rm is the second most importance variable, .., given this tree structure

Algorithm to split the data is based on

Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.

Splitting

Gini Impurity and Gini Gain

\(Gini \; Impurity= \sum_{i=1}^{S}p_i(1-p_i)=\sum_{i=1}^{S}(p_i-p_i^2)=1-\sum_{i=1}^{S}p_i^2\)

Let attributes are \(z_j, \; j=1, \cdots,k\). The gini gain for attribute \(z_j\) is given as,

\(Gini \; Impurity\:(Before\;Split)-Weighted \; Gini \; Impurity \:(After \;Split \; z_j<z_j^*)\)

Gini Impurity and Gini Gain

Ex: (5 Red and 5 Green)

\(X\) \(Y\)
2.1 Green
1.1 Red
3 Green
0.5 Red
2.5 Green
1.5 Red
0.2 Red
5 Green
1.8 Red
4.2 Green

\(Gini\:Impurity(y)=\frac {5}{10} \times (1-\frac {5}{10})+\frac {5}{10} \times (1-\frac {5}{10})=0.5\)

or

\(1- \Big(\big(\frac {5}{10}\big)^2+\big(\frac {5}{10}\big)^2\Big)=1-\frac {1}{2}=0.5\)

Gini Impurity and Gini Gain

Assume split the data with \(X<1.7\), then number of observations are, 3 and 7 respectively. The first set contains only Red. The second set has 5 greens and 2 reds.

\(Gini\:Impurity(Y \: splitted \: at \; X<1.7)= \frac {X<1.7=4}{10} \times 0 + \frac {X\geq 10=6}{10} \times \big[\frac {1}{6}(1-\frac {1}{6})+\frac {5}{6}(1-\frac {5}{6}) \big]\)

\(Gini\:Impurity(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.278=0.222\)

\(Gini\:Gain(Y \: splitted \: X<1.7)= 0.5-0.286=0.214\)

One of the best split is with \(X<2\), which splits the data so that each data set contains either Green or Red, which means that Gini Impurity after splitting is 0

Entropy, Conditional Entropy, Information Gain

\(H(X)=-\sum_{x_j \in S } p_j \: log(p_j), \:for \; p_j \neq 0, \: and \: log(p_j) \equiv 0 \: for \: p_j=0\)

Another way of seeing this function

\(H(X)=\sum_{x_j \in S } p_j \: \big(log(p_j)\big)^{-1}=\sum_{x_j \in S } p_j \: \frac{1}{log(p_j)}=E\big[\frac{1}{log(p_j)}\big]\)

Entropy, Conditional Entropy, Information Gain

Entropy, Conditional Entropy, Information Gain

Entropy, Conditional Entropy, Information Gain

Let attributes are \(z_j, \; j=1, \cdots,k\). The Information Gain for attribute \(z_j\) is given as,

\(IG(Y|z_j; z_j^*)=Entropy \:(Before\;Split)- Weighted \; Entropy \:(After \;Split \; z_j<z_j^*)\)

\(IG(Y|z_j; z_j^*)=H(Y) - \big(p(z_j< z_j^*) H(Y|z_j < z_j^*)+p(z_j \geq z_j^*) H(Y|z_j \geq z_j^*)\big)\)

Information Gain

Ex: (5 Red and 5 Green)

\(X\) \(Y\)
2.1 Green
1.1 Red
3 Green
0.5 Red
2.5 Green
1.5 Red
0.2 Red
5 Green
1.8 Red
4.2 Green

\(P(Y=Green)=P(Y=Red)=\frac{5}{10}=\frac{1}{2}=2^{-1}\)

\(H(Y)=-P(Y=Green) \times log_2(P(Y=Green)) - (1-P(Y=Green)) \times log_2(1-P(Y=Green))\)

Information Gain

\(H(Y)=-\frac {1}{2} \times log_2(\frac {1}{2})+(1-\frac {1}{2}) \times log_2(1-\frac {1}{2})=1\)

Assume split the data with \(X<1.7\), then number of observations are, 4 and 6 respectively. The first set contains only Red hence entropy is zero. The second set has 5 greens and 1 red. The entropy measure for the second set is then,

\(H(Y|X \geq 1.7)=-\frac {1}{6} \times log_2(\frac {1}{6})-\frac {5}{6} \times log_2(\frac {5}{8})=0.65\)

\(Entropy(Y \: splitted \: at \; X<1.7)= \frac {X<1.7}{10} \times 0 + \frac {X\geq 10}{10} \times 0.65\)

\(H(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.65=0.39\)

\(Information\:Gain(Y | X;1.7)= 1-0.39=0.61\)

Model Based Tree

Rule 1 (Node 6, with 69 Obs.):

If gender = female and age < 40 then \(eval=4+0.12 \times beauty+\varepsilon\)

..

etc.

*:These examples are produced using different R packages

Some Pros and Cons of Decision Trees

Simple Example using iris data (Classification Tree)

Simple Example using iris data (Classification Tree)

Simple Example using iris data (Classification Tree)

More Complex Classification Tree

Pruning Decision Tree

Pruning Decision Tree

\(\sum_{m=1}^{|T|} \sum_{x_i \in R_m}^{|T|} (y_i - \hat y_{R_m})^2 + \alpha |T|\)

where, \(\alpha\) is a non-negative tuning parameter, \(|T|\) is the number of terminal nodes, \(R_m\) is the subset corresponding to \(m^{th}\) terminal node, \(\hat y_{R_m}\) is the mean of the training observations in \(R_m\)

Model Based Example: Demand for Economic Journals

Example: Demand for Economic Journals

Rules for Journals Data

Rule 1 (Young Journals, 53 Observations):

if \(age \leq 18\) then \(ln(subs)=4.35 -0.605 \times ln(\frac {price}{citations}) + \varepsilon_{young}\)

Rule 2 (Old Journals, 127 Observations):

if \(age > 18\) then \(ln(subs)=5.01 -0.403\times ln(\frac {price}{citations})+ \varepsilon_{old}\)