KeyWords

Nodes
Edges
Leaves, Terminal Nodes
Top-Down Induction of Trees
Regression Trees, Classification Trees, (Generic Term, Classification And Regression Tree, CART)
Entropy, Conditional Entropy, Information Gain
Decision Rules (Rule Based Systems)

Example: Titanic Data (Classification Tree)

Classification Tree Example (FFTrees Package)

Example: Boston Housing Data

Linear Regression: Boston Housing Data


Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Example: Boston Housing Data (Regression Tree)

Example: Teaching Ratings

Data on course evaluations, course characteristics, and professor characteristics for 463 courses for the academic years 2000–2002 at the University of Texas at Austin. Data is available from R AES package.

Example: Teaching Ratings

Linear Model Tree are example of a model tree where terminal nodes are regression models conditional on some other variables

Decision Tree

A tree like structure of data
Algorithm Creates Regions that are easily converted to Rules
It Creates Rules that are read upside down. Starts with the top node
Top to Bottom also gives the importance of the variables in order
The above examples are; Classification Tree (Titanic), Regression Tree (Boston) and Model Based Tree (Teaching Ratings) respectively.
Classification Tree is a tree where the outcome is categorical labels
Regression Tree is a tree where the outcome is real numbers
Model Based Tree is a tree where the outcome is models

Decision Tree

Rule 1:

If rm < 6.9 and lstat >= 14 and crim >= 7 then medv = 12

Rule 2:

If rm >= 6.941 then medv = 45.1

… etc.

Importance:

rm is the most important variable, rm is the second most importance variable, .., given this tree structure

Algorithm to split the data is based on

Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.

Splitting

For Classification and Regression Trees
In order to split the data set:
Gini Gain based on Gini Impurity or
Information Gain are used
For Model Based Trees

Gini Impurity and Gini Gain

Assume that there are \(S\) classes. The probability of picking an observation is given as \(p_i, \; i=1,\cdots, S\).
Gini Impurity is then:

\(Gini \; Impurity= \sum_{i=1}^{S}p_i(1-p_i)=\sum_{i=1}^{S}(p_i-p_i^2)=1-\sum_{i=1}^{S}p_i^2\)

Gini Impurity results in a value of 0 if the data set contains only one class (\(p_i=1\)).
Gini Gain is calculated as how much gini impurity is reduced by splitting data based on an attribute

Let attributes are \(z_j, \; j=1, \cdots,k\). The gini gain for attribute \(z_j\) is given as,

\(Gini \; Impurity\:(Before\;Split)-Weighted \; Gini \; Impurity \:(After \;Split \; z_j<z_j^*)\)

Then the best splitting is obtained by maximization of Gini Gain.
Splitting the data set continuous until no further gain is possible.

Gini Impurity and Gini Gain

Ex: (5 Red and 5 Green)

\(X\)	\(Y\)
2.1	Green
1.1	Red
3	Green
0.5	Red
2.5	Green
1.5	Red
0.2	Red
5	Green
1.8	Red
4.2	Green

\(Gini\:Impurity(y)=\frac {5}{10} \times (1-\frac {5}{10})+\frac {5}{10} \times (1-\frac {5}{10})=0.5\)

\(1- \Big(\big(\frac {5}{10}\big)^2+\big(\frac {5}{10}\big)^2\Big)=1-\frac {1}{2}=0.5\)

Gini Impurity and Gini Gain

Assume split the data with \(X<1.7\), then number of observations are, 3 and 7 respectively. The first set contains only Red. The second set has 5 greens and 2 reds.

\(Gini\:Impurity(Y \: splitted \: at \; X<1.7)= \frac {X<1.7=4}{10} \times 0 + \frac {X\geq 10=6}{10} \times \big[\frac {1}{6}(1-\frac {1}{6})+\frac {5}{6}(1-\frac {5}{6}) \big]\)

\(Gini\:Impurity(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.278=0.222\)

\(Gini\:Gain(Y \: splitted \: X<1.7)= 0.5-0.286=0.214\)

One of the best split is with \(X<2\), which splits the data so that each data set contains either Green or Red, which means that Gini Impurity after splitting is 0

Entropy, Conditional Entropy, Information Gain

Information Size, Information Content, Uncertainty
Information -generated-, -stored- or -transmitted- as a variable
Information Entropy is the average rate at which information is produced by a stochastic source of data (Wikipedia).

\(H(X)=-\sum_{x_j \in S } p_j \: log(p_j), \:for \; p_j \neq 0, \: and \: log(p_j) \equiv 0 \: for \: p_j=0\)

Another way of seeing this function

\(H(X)=\sum_{x_j \in S } p_j \: \big(log(p_j)\big)^{-1}=\sum_{x_j \in S } p_j \: \frac{1}{log(p_j)}=E\big[\frac{1}{log(p_j)}\big]\)

Entropy, Conditional Entropy, Information Gain

Logarithm is just a transformation
\(\frac{1}{p_j}\) is increasing with decreasing \(p_j\). If it is the rare event it then the inverse is big.
For \(p_j=0\) and \(p_j=1\) the entropy is \(0\) (certain cases). Otherwise it is increasing towards the most uncertain cases.
For example, flip a coin: \(P(head)+P(tail)=1\), let \(P(head)=0.5\) then, \(H(Flip \: Coin)=-0.5*log_2(0.5)-0.5*log_2(0.5)=1\) For all other cases, \(P(head) \neq 0.5\) the entropy is less than \(1\).
Hence it is related with uncertainty of the outcome

Entropy, Conditional Entropy, Information Gain

Similar to the Gini Impurity and Gain

Let attributes are \(z_j, \; j=1, \cdots,k\). The Information Gain for attribute \(z_j\) is given as,

\(IG(Y|z_j; z_j^*)=Entropy \:(Before\;Split)- Weighted \; Entropy \:(After \;Split \; z_j<z_j^*)\)

\(IG(Y|z_j; z_j^*)=H(Y) - \big(p(z_j< z_j^*) H(Y|z_j < z_j^*)+p(z_j \geq z_j^*) H(Y|z_j \geq z_j^*)\big)\)

Then the best splitting is obtained by maximization of Information Gain.
Splitting the data set continuous until no further gain is possible.

Information Gain

Ex: (5 Red and 5 Green)

\(X\)	\(Y\)
2.1	Green
1.1	Red
3	Green
0.5	Red
2.5	Green
1.5	Red
0.2	Red
5	Green
1.8	Red
4.2	Green

\(P(Y=Green)=P(Y=Red)=\frac{5}{10}=\frac{1}{2}=2^{-1}\)

\(H(Y)=-P(Y=Green) \times log_2(P(Y=Green)) - (1-P(Y=Green)) \times log_2(1-P(Y=Green))\)

Information Gain

\(H(Y)=-\frac {1}{2} \times log_2(\frac {1}{2})+(1-\frac {1}{2}) \times log_2(1-\frac {1}{2})=1\)

Assume split the data with \(X<1.7\), then number of observations are, 4 and 6 respectively. The first set contains only Red hence entropy is zero. The second set has 5 greens and 1 red. The entropy measure for the second set is then,

\(H(Y|X \geq 1.7)=-\frac {1}{6} \times log_2(\frac {1}{6})-\frac {5}{6} \times log_2(\frac {5}{8})=0.65\)

\(Entropy(Y \: splitted \: at \; X<1.7)= \frac {X<1.7}{10} \times 0 + \frac {X\geq 10}{10} \times 0.65\)

\(H(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.65=0.39\)

\(Information\:Gain(Y | X;1.7)= 1-0.39=0.61\)

Again best split results in \(IG(Y|X;2)=1\) which perfectly splits the data so that the entropy for both partitions outcome are zero

Model Based Tree

Model Based Tree is a tree where the outcome is a model (in this case the model is, \(eval=\beta_0+\beta_1 beauty\))

Rule 1 (Node 6, with 69 Obs.):

If gender = female and age < 40 then \(eval=4+0.12 \times beauty+\varepsilon\)

etc.

The basic steps are:
- Fit a parametric model to a data set
- Test for parameter instability over a set of partitioning variables
- If there is some overall parameter instability, split the model with respect to the variable associated with the highest instability
- Repeat the procedure in each of the child nodes.

*:These examples are produced using different R packages

Some Pros and Cons of Decision Trees

Simple to understand
For both numerical and categorical data
Not a blackbox model. Easy to interpret
Data preparation is not a problem (may require little preparation)
Close to human decision making
…
Can be very sensitive to small changes in features
Easily Overfit the data (Can create very complex trees)
May be biased based on the input variable distributions (search conditional tree over internet)

Simple Example using iris data (Classification Tree)

Let’s start with usual summary plot
Petal.Length and Petal.Width may be used to find the classification rules (for the sake of simplicity)

Simple Example using iris data (Classification Tree)

The decision boundary for the tree

More Complex Classification Tree

Pruning Decision Tree

The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance.
A smaller tree with fewer splits (that is, fewer regions \(R_1\),…,\(R_J\) ) might lead to lower variance and better interpretation at the cost of a little bias.
One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold.
This strategy will result in smaller trees, but is too short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split-that is, a split that leads to a large reduction in RSS later on.

Pruning Decision Tree

Start with a large tree, \(T_0\) then prune it back to obtain sub-tree
Cost complexity pruning, also known as weakest link pruning, is used to do this, wehere the following objective function is minimized

\(\sum_{m=1}^{|T|} \sum_{x_i \in R_m}^{|T|} (y_i - \hat y_{R_m})^2 + \alpha |T|\)

where, \(\alpha\) is a non-negative tuning parameter, \(|T|\) is the number of terminal nodes, \(R_m\) is the subset corresponding to \(m^{th}\) terminal node, \(\hat y_{R_m}\) is the mean of the training observations in \(R_m\)

Optimal \(\hat \alpha\) is selected through cross validation

Model Based Example: Demand for Economic Journals

The price elasticity of the demand for 180 economic journals
\(y\): logarithm of the number of US library subscriptions
\(x\): logarithm of price per citation

Example: Demand for Economic Journals

Rules for Journals Data

For this data set the optimum partitioning is related with the age of the journal. Young Journals (\(age \leq 18\)) and old Journals (\(age > 18\)).

Rule 1 (Young Journals, 53 Observations):

if \(age \leq 18\) then \(ln(subs)=4.35 -0.605 \times ln(\frac {price}{citations}) + \varepsilon_{young}\)

Rule 2 (Old Journals, 127 Observations):

if \(age > 18\) then \(ln(subs)=5.01 -0.403\times ln(\frac {price}{citations})+ \varepsilon_{old}\)

Decision Tree

Objectives

Reading Suggestions

KeyWords

Example: Titanic Data (Classification Tree)

Classification Tree Example (FFTrees Package)

Example: Boston Housing Data

Example: Boston Housing Data

Linear Regression: Boston Housing Data

Example: Boston Housing Data (Regression Tree)

Example: Teaching Ratings

Example: Teaching Ratings

Example: Teaching Ratings

Decision Tree

Decision Tree

Splitting

Gini Impurity and Gini Gain

Gini Impurity and Gini Gain

Gini Impurity and Gini Gain

Entropy, Conditional Entropy, Information Gain

Entropy, Conditional Entropy, Information Gain

Entropy, Conditional Entropy, Information Gain

Entropy, Conditional Entropy, Information Gain

Information Gain

Information Gain

Model Based Tree

Some Pros and Cons of Decision Trees

Simple Example using iris data (Classification Tree)

Simple Example using iris data (Classification Tree)

Simple Example using iris data (Classification Tree)

More Complex Classification Tree

Pruning Decision Tree

Pruning Decision Tree

Model Based Example: Demand for Economic Journals

Example: Demand for Economic Journals

Rules for Journals Data