Decision Tree

I. Ozkan

Spring 2025

Learning Objectives

Reading Suggestions

Book Chapters

Internet Resources

KeyWords

Tree Like Structure of Data (Titanic Data)

Classification Tree Example (FFTrees Package)

Another Example Boston Housing Data

Boston Housing Data

Boston Housing Data (Linear Regression)


Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Boston Housing Data (Regression Tree)

Teaching Ratings (skipped)

Data on course evaluations, course characteristics, and professor characteristics for 463 courses for the academic years 2000–2002 at the University of Texas at Austin. Data is available from R AES package.

Another Example Teaching Ratings (skipped)

Another Example Teaching Ratings (skipped)

Decision Tree

-Rules are read upside down. Starts with the top node.

Decision Tree

Rule 1:

If rm < 6.941 and lstat >= 14.4 and crim >= 6.99237 then medv = 11.98

Rule 2:

If rm >= 6.941 then medv = 45.1

… etc.

Importance:

rm is the most important variable, rm is the second most importance variable, .., given this tree structure

Algorithm to split the data is based on

Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.

Splitting

Gini Impurity and Gini Gain

\(Gini \; Impurity= \sum_{i=1}^{S}p_i(1-p_i)=\sum_{i=1}^{S}(p_i-p_i^2)=1-\sum_{i=1}^{S}p_i^2\)

Let attributes are \(z_j, \; j=1, \cdots,k\). The gini gain for attribute \(z_j\) is given as,

\(Gini \; Impurity\:(Before\;Split)-Weighted \; Gini \; Impurity \:(After \;Split \; z_j<z_j^*)\)

Gini Impurity and Gini Gain

Ex: (5 Red and 5 Green)

\(X\) \(Y\)
2.1 Green
1.1 Red
3 Green
0.5 Red
2.5 Green
1.5 Red
0.2 Red
5 Green
1.8 Red
4.2 Green

\(Gini\:Impurity(y)=\frac {5}{10} \times (1-\frac {5}{10})+\frac {5}{10} \times (1-\frac {5}{10})=0.5\)

or

\(1- \Big(\big(\frac {5}{10}\big)^2+\big(\frac {5}{10}\big)^2\Big)=1-\frac {1}{2}=0.5\)

Gini Impurity and Gini Gain

Assume split the data with \(X<1.7\), then number of observations are, 3 and 7 respectively. The first set contains only Red. The second set has 5 greens and 2 reds.

\(Gini\:Impurity(Y \: splitted \: at \; X<1.7)= \frac {X<1.7=4}{10} \times 0 + \frac {X\geq 10=6}{10} \times \big[\frac {1}{6}(1-\frac {1}{6})+\frac {5}{6}(1-\frac {5}{6}) \big]\)

\(Gini\:Impurity(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.278=0.222\)

\(Gini\:Gain(Y \: splitted \: X<1.7)= 0.5-0.286=0.214\)

One of the best split is with \(X<2\), which splits the data so that each data set contains either Green or Red, which means that Gini Impurity after splitting is 0

Entropy, Conditional Entropy, Information Gain

\(H(X)=-\sum_{x_j \in S } p_j \: log(p_j), \:for \; p_j \neq 0, \: and \: log(p_j) \equiv 0 \: for \: p_j=0\)

Another way of seeing this function

\(H(X)=\sum_{x_j \in S } p_j \: \big(log(p_j)\big)^{-1}=\sum_{x_j \in S } p_j \: \frac{1}{log(p_j)}=E\big[\frac{1}{log(p_j)}\big]\)

Entropy, Conditional Entropy, Information Gain

Entropy, Conditional Entropy, Information Gain

Entropy, Conditional Entropy, Information Gain

Let attributes are \(z_j, \; j=1, \cdots,k\). The Information Gain for attribute \(z_j\) is given as,

\(IG(Y|z_j; z_j^*)=Entropy \:(Before\;Split)- Weighted \; Entropy \:(After \;Split \; z_j<z_j^*)\)

\(IG(Y|z_j; z_j^*)=H(Y) - \big(p(z_j< z_j^*) H(Y|z_j < z_j^*)+p(z_j \geq z_j^*) H(Y|z_j \geq z_j^*)\big)\)

Information Gain

Ex: (5 Red and 5 Green)

\(X\) \(Y\)
2.1 Green
1.1 Red
3 Green
0.5 Red
2.5 Green
1.5 Red
0.2 Red
5 Green
1.8 Red
4.2 Green

\(P(Y=Green)=P(Y=Red)=\frac{5}{10}=\frac{1}{2}=2^{-1}\)

\(H(Y)=-P(Y=Green) \times log_2(P(Y=Green)) - (1-P(Y=Green)) \times log_2(1-P(Y=Green))\)

Information Gain

\(H(Y)=-\frac {1}{2} \times log_2(\frac {1}{2})+(1-\frac {1}{2}) \times log_2(1-\frac {1}{2})=1\)

Assume split the data with \(X<1.7\), then number of observations are, 4 and 6 respectively. The first set contains only Red hence entropy is zero. The second set has 5 greens and 1 red. The entropy measure for the second set is then,

\(H(Y|X \geq 1.7)=-\frac {1}{6} \times log_2(\frac {1}{6})-\frac {5}{6} \times log_2(\frac {5}{8})=0.65\)

\(Entropy(Y \: splitted \: at \; X<1.7)= \frac {X<1.7}{10} \times 0 + \frac {X\geq 10}{10} \times 0.65\)

\(H(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.65=0.39\)

\(Information\:Gain(Y | X;1.7)= 1-0.39=0.61\)

Simple Example using iris data (Classification Tree)

Simple Example using iris data (Classification Tree)

Simple Example using iris data (Classification Tree)

Simple Example using iris data (Classification Tree)

Simple Example using iris data (Classification Tree)

Simple Example using iris data (Classification Tree)

Simple Example using iris data (A Complex one)

Simple Example using iris data (A Complex one)

Simple Example using iris data (A Complex one)

Simple Example using iris data (A Complex one)

Simple Example using iris data (A Complex one)

Pruning Decision Tree (Statistical Learing Course Slides)

Pruning Decision Tree

\[\sum_{m=1}^{|T|} \sum_{x_i \in R_m}^{|T|} (y_i - \hat y_{R_m})^2 + \alpha |T|\] where, \(\alpha\) is a nen-negative tuning parameter, \(|T|\) is the number of terminal nodes, \(R_m\) is the subset corresponding to \(m^{th}\) terminal node, \(\hat y_{R_m}\) is the mean of the training observations in \(R_m\)

Some Pros and Cons of Decision Trees