Learning Objectives

Understand Decision Tree, Bagging Random Forests and Boosting

Get a Brief Idea About Decision Tree Algorithm

Fit decision tree models using R packages

Be able to visualize the decision tree

Be able to interpret the decision tree

Reading Suggestions

Book Chapters

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 8

Internet Resources

KeyWords

Nodes
Edges
Leaves, Terminal Nodes
Top-Down Induction of Trees
Regression Trees, Classification Trees, (Generic Term, Classification And Regression Tree, CART)
Entropy, Conditional Entropy, Information Gain (Not in very detail)
Decision Rules (Rule Based Systems)

Tree Like Structure of Data (Titanic Data)

Classification Tree Example (FFTrees Package)

Another Example Boston Housing Data

Boston Housing Data (Linear Regression)


Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Boston Housing Data (Regression Tree)

Teaching Ratings (skipped)

Data on course evaluations, course characteristics, and professor characteristics for 463 courses for the academic years 2000–2002 at the University of Texas at Austin. Data is available from R AES package.

Another Example Teaching Ratings (skipped)

Tree where terminal nodes are simple regressions conditional on some other variables

Source: https://github.com/martin-borkovec/ggparty

Decision Tree

A tree like structure of data
Algorithm Creates Regions that are easily converted to Rules

-Rules are read upside down. Starts with the top node.

Top to Bottom also gives the importance of the variables in order.
The above examples are; Classification Tree (Titanic), Regression Tree (Boston) and Model Based Tree (Teaching Ratings) respectively.
Classification Tree is a tree where the outcome is categorical labels
Regression Tree is a tree where the outcome is real numbers
Model Based Tree is a tree where the outcome is models

Decision Tree

Rule 1:

If rm < 6.941 and lstat >= 14.4 and crim >= 6.99237 then medv = 11.98

Rule 2:

If rm >= 6.941 then medv = 45.1

… etc.

Importance:

rm is the most important variable, rm is the second most importance variable, .., given this tree structure

Algorithm to split the data is based on

Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.

Splitting

For Classification and Regression Trees
In order to split the data set:
Gini Gain based on Gini Impurity or
Information Gain are used
For Model Based Trees

Gini Impurity and Gini Gain

Assume that there are \(S\) classes. The probability of picking an observation is given as \(p_i, \; i=1,\cdots, S\).
Gini Impurity is then:

\(Gini \; Impurity= \sum_{i=1}^{S}p_i(1-p_i)=\sum_{i=1}^{S}(p_i-p_i^2)=1-\sum_{i=1}^{S}p_i^2\)

Gini Impurity results in a value of 0 if the data set contains only one class (\(p_i=1\)).
Gini Gain is calculated as how much gini impurity is reduced by splitting data based on an attribute

Let attributes are \(z_j, \; j=1, \cdots,k\). The gini gain for attribute \(z_j\) is given as,

\(Gini \; Impurity\:(Before\;Split)-Weighted \; Gini \; Impurity \:(After \;Split \; z_j<z_j^*)\)

Then the best splitting is obtained by maximization of Gini Gain.
Splitting the data set continuous until no further gain is possible.

Gini Impurity and Gini Gain

Ex: (5 Red and 5 Green)

\(X\)	\(Y\)
2.1	Green
1.1	Red
3	Green
0.5	Red
2.5	Green
1.5	Red
0.2	Red
5	Green
1.8	Red
4.2	Green

\(Gini\:Impurity(y)=\frac {5}{10} \times (1-\frac {5}{10})+\frac {5}{10} \times (1-\frac {5}{10})=0.5\)

\(1- \Big(\big(\frac {5}{10}\big)^2+\big(\frac {5}{10}\big)^2\Big)=1-\frac {1}{2}=0.5\)

Gini Impurity and Gini Gain

Assume split the data with \(X<1.7\), then number of observations are, 3 and 7 respectively. The first set contains only Red. The second set has 5 greens and 2 reds.

\(Gini\:Impurity(Y \: splitted \: at \; X<1.7)= \frac {X<1.7=4}{10} \times 0 + \frac {X\geq 10=6}{10} \times \big[\frac {1}{6}(1-\frac {1}{6})+\frac {5}{6}(1-\frac {5}{6}) \big]\)

\(Gini\:Impurity(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.278=0.222\)

\(Gini\:Gain(Y \: splitted \: X<1.7)= 0.5-0.286=0.214\)

One of the best split is with \(X<2\), which splits the data so that each data set contains either Green or Red, which means that Gini Impurity after splitting is 0

Entropy, Conditional Entropy, Information Gain

Information Size, Information Content, Uncertainty
Information -generated-, -stored- or -transmitted- as a variable
Information Entropy is the average rate at which information is produced by a stochastic source of data (Wikipedia).

\(H(X)=-\sum_{x_j \in S } p_j \: log(p_j), \:for \; p_j \neq 0, \: and \: log(p_j) \equiv 0 \: for \: p_j=0\)

Another way of seeing this function

\(H(X)=\sum_{x_j \in S } p_j \: \big(log(p_j)\big)^{-1}=\sum_{x_j \in S } p_j \: \frac{1}{log(p_j)}=E\big[\frac{1}{log(p_j)}\big]\)

Entropy, Conditional Entropy, Information Gain

Logarithm is just a transformation
\(\frac{1}{p_j}\) is increasing with decreasing \(p_j\). If it is the rare event it then the inverse is big.
For \(p_j=0\) and \(p_j=1\) the entropy is \(0\) (certain cases). Otherwise it is increasing towards the most uncertain cases.
For example, flip a coin: \(P(head)+P(tail)=1\), let \(P(head)=0.5\) then, \(H(Flip \: Coin)=-0.5*log_2(0.5)-0.5*log_2(0.5)=1\) For all other cases, \(P(head) \neq 0.5\) the entropy is less than \(1\).
Hence it is related with uncertainty of the outcome

Entropy, Conditional Entropy, Information Gain

Similar to the Gini Impurity and Gain

Let attributes are \(z_j, \; j=1, \cdots,k\). The Information Gain for attribute \(z_j\) is given as,

\(IG(Y|z_j; z_j^*)=Entropy \:(Before\;Split)- Weighted \; Entropy \:(After \;Split \; z_j<z_j^*)\)

\(IG(Y|z_j; z_j^*)=H(Y) - \big(p(z_j< z_j^*) H(Y|z_j < z_j^*)+p(z_j \geq z_j^*) H(Y|z_j \geq z_j^*)\big)\)

Then the best splitting is obtained by maximization of Information Gain.
Splitting the data set continuous until no further gain is possible.

Information Gain

Ex: (5 Red and 5 Green)

\(X\)	\(Y\)
2.1	Green
1.1	Red
3	Green
0.5	Red
2.5	Green
1.5	Red
0.2	Red
5	Green
1.8	Red
4.2	Green

\(P(Y=Green)=P(Y=Red)=\frac{5}{10}=\frac{1}{2}=2^{-1}\)

\(H(Y)=-P(Y=Green) \times log_2(P(Y=Green)) - (1-P(Y=Green)) \times log_2(1-P(Y=Green))\)

Information Gain

\(H(Y)=-\frac {1}{2} \times log_2(\frac {1}{2})+(1-\frac {1}{2}) \times log_2(1-\frac {1}{2})=1\)

Assume split the data with \(X<1.7\), then number of observations are, 4 and 6 respectively. The first set contains only Red hence entropy is zero. The second set has 5 greens and 1 red. The entropy measure for the second set is then,

\(H(Y|X \geq 1.7)=-\frac {1}{6} \times log_2(\frac {1}{6})-\frac {5}{6} \times log_2(\frac {5}{8})=0.65\)

\(Entropy(Y \: splitted \: at \; X<1.7)= \frac {X<1.7}{10} \times 0 + \frac {X\geq 10}{10} \times 0.65\)

\(H(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.65=0.39\)

\(Information\:Gain(Y | X;1.7)= 1-0.39=0.61\)

Again best split results in \(IG(Y|X;2)=1\) which perfectly splits the data so that the entropy for both partitions outcome are zero

Simple Example using iris data (Classification Tree)

Let’s start with usual summary plot
Petal.Length and Petal.Width may be used to find the classification rules

Simple Example using iris data (Classification Tree)

Petal.Length and Petal.Width may be used to find the classification rules

Simple Example using iris data (Classification Tree)

The decision boundary for the tree

Simple Example using iris data (Classification Tree)

Simple Example using iris data (A Complex one)

Let’s start with usual summary plot
Sepal.Length and Sepal.Width may be used to find the classification rules

Simple Example using iris data (A Complex one)

Pruning Decision Tree (Statistical Learing Course Slides)

The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance.
A smaller tree with fewer splits (that is, fewer regions \(R_1\),…,\(R_J\) ) might lead to lower variance and better interpretation at the cost of a little bias.
One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold.
This strategy will result in smaller trees, but is too short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split-that is, a split that leads to a large reduction in RSS later on.

Pruning Decision Tree

Start with a large tree, \(T_0\) then prune it back to obtain sub-tree
Cost complexity pruning, also known as weakest link pruning, is used to do this, wehere the following objective function is minimized

\[\sum_{m=1}^{|T|} \sum_{x_i \in R_m}^{|T|} (y_i - \hat y_{R_m})^2 + \alpha |T|\] where, \(\alpha\) is a nen-negative tuning parameter, \(|T|\) is the number of terminal nodes, \(R_m\) is the subset corresponding to \(m^{th}\) terminal node, \(\hat y_{R_m}\) is the mean of the training observations in \(R_m\)

Optimal \(\hat \alpha\) is selected through cross validation

Some Pros and Cons of Decision Trees

Simple to understand
For both numerical and categorical data
Not a blackbox model. Easy to interpret
Data preparation is not a problem (may require little preparation)
Close to human decision making
…
Can be very sensitive to small changes in features
Easily Overfit the data (Can create very complex trees)
May be biased based on the input variable distributions (search conditional tree over internet)