I. Ozkan
Spring 2025
Book Chapters
Internet Resources
Nodes
Edges
Leaves, Terminal Nodes
Top-Down Induction of Trees
Regression Trees, Classification Trees, (Generic Term, Classification And Regression Tree, CART)
Entropy, Conditional Entropy, Information Gain (Not in very detail)
Decision Rules (Rule Based Systems)
Call:
lm(formula = medv ~ ., data = Boston)
Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
Data on course evaluations, course characteristics, and professor characteristics for 463 courses for the academic years 2000–2002 at the University of Texas at Austin. Data is available from R AES package.
A tree like structure of data
Algorithm Creates Regions that are easily converted to Rules
-Rules are read upside down. Starts with the top node.
Top to Bottom also gives the importance of the variables in order.
The above examples are; Classification Tree (Titanic), Regression Tree (Boston) and Model Based Tree (Teaching Ratings) respectively.
Classification Tree is a tree where the outcome is categorical labels
Regression Tree is a tree where the outcome is real numbers
Model Based Tree is a tree where the outcome is models
Rule 1:
If rm < 6.941 and lstat >= 14.4 and crim >= 6.99237 then medv = 11.98
Rule 2:
If rm >= 6.941 then medv = 45.1
… etc.
Importance:
rm is the most important variable, rm is the second most importance variable, .., given this tree structure
Algorithm to split the data is based on
Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.
For Classification and Regression Trees
In order to split the data set:
Gini Gain based on Gini Impurity or
Information Gain are used
For Model Based Trees
Assume that there are \(S\) classes. The probability of picking an observation is given as \(p_i, \; i=1,\cdots, S\).
Gini Impurity is then:
\(Gini \; Impurity= \sum_{i=1}^{S}p_i(1-p_i)=\sum_{i=1}^{S}(p_i-p_i^2)=1-\sum_{i=1}^{S}p_i^2\)
Gini Impurity results in a value of 0 if the data set contains only one class (\(p_i=1\)).
Gini Gain is calculated as how much gini impurity is reduced by splitting data based on an attribute
Let attributes are \(z_j, \; j=1, \cdots,k\). The gini gain for attribute \(z_j\) is given as,
\(Gini \; Impurity\:(Before\;Split)-Weighted \; Gini \; Impurity \:(After \;Split \; z_j<z_j^*)\)
Then the best splitting is obtained by maximization of Gini Gain.
Splitting the data set continuous until no further gain is possible.
Ex: (5 Red and 5 Green)
| \(X\) | \(Y\) | 
|---|---|
| 2.1 | Green | 
| 1.1 | Red | 
| 3 | Green | 
| 0.5 | Red | 
| 2.5 | Green | 
| 1.5 | Red | 
| 0.2 | Red | 
| 5 | Green | 
| 1.8 | Red | 
| 4.2 | Green | 
\(Gini\:Impurity(y)=\frac {5}{10} \times (1-\frac {5}{10})+\frac {5}{10} \times (1-\frac {5}{10})=0.5\)
or
\(1- \Big(\big(\frac {5}{10}\big)^2+\big(\frac {5}{10}\big)^2\Big)=1-\frac {1}{2}=0.5\)
Assume split the data with \(X<1.7\), then number of observations are, 3 and 7 respectively. The first set contains only Red. The second set has 5 greens and 2 reds.
\(Gini\:Impurity(Y \: splitted \: at \; X<1.7)= \frac {X<1.7=4}{10} \times 0 + \frac {X\geq 10=6}{10} \times \big[\frac {1}{6}(1-\frac {1}{6})+\frac {5}{6}(1-\frac {5}{6}) \big]\)
\(Gini\:Impurity(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.278=0.222\)
\(Gini\:Gain(Y \: splitted \: X<1.7)= 0.5-0.286=0.214\)
One of the best split is with \(X<2\), which splits the data so that each data set contains either Green or Red, which means that Gini Impurity after splitting is 0
Information Size, Information Content, Uncertainty
Information -generated-, -stored- or -transmitted- as a variable
Information Entropy is the average rate at which information is produced by a stochastic source of data (Wikipedia).
\(H(X)=-\sum_{x_j \in S } p_j \: log(p_j), \:for \; p_j \neq 0, \: and \: log(p_j) \equiv 0 \: for \: p_j=0\)
Another way of seeing this function
\(H(X)=\sum_{x_j \in S } p_j \: \big(log(p_j)\big)^{-1}=\sum_{x_j \in S } p_j \: \frac{1}{log(p_j)}=E\big[\frac{1}{log(p_j)}\big]\)
Logarithm is just a transformation
\(\frac{1}{p_j}\) is increasing with decreasing \(p_j\). If it is the rare event it then the inverse is big.
For \(p_j=0\) and \(p_j=1\) the entropy is \(0\) (certain cases). Otherwise it is increasing towards the most uncertain cases.
For example, flip a coin: \(P(head)+P(tail)=1\), let \(P(head)=0.5\) then, \(H(Flip \: Coin)=-0.5*log_2(0.5)-0.5*log_2(0.5)=1\) For all other cases, \(P(head) \neq 0.5\) the entropy is less than \(1\).
Hence it is related with uncertainty of the outcome
Let attributes are \(z_j, \; j=1, \cdots,k\). The Information Gain for attribute \(z_j\) is given as,
\(IG(Y|z_j; z_j^*)=Entropy \:(Before\;Split)- Weighted \; Entropy \:(After \;Split \; z_j<z_j^*)\)
\(IG(Y|z_j; z_j^*)=H(Y) - \big(p(z_j< z_j^*) H(Y|z_j < z_j^*)+p(z_j \geq z_j^*) H(Y|z_j \geq z_j^*)\big)\)
Then the best splitting is obtained by maximization of Information Gain.
Splitting the data set continuous until no further gain is possible.
Ex: (5 Red and 5 Green)
| \(X\) | \(Y\) | 
|---|---|
| 2.1 | Green | 
| 1.1 | Red | 
| 3 | Green | 
| 0.5 | Red | 
| 2.5 | Green | 
| 1.5 | Red | 
| 0.2 | Red | 
| 5 | Green | 
| 1.8 | Red | 
| 4.2 | Green | 
\(P(Y=Green)=P(Y=Red)=\frac{5}{10}=\frac{1}{2}=2^{-1}\)
\(H(Y)=-P(Y=Green) \times log_2(P(Y=Green)) - (1-P(Y=Green)) \times log_2(1-P(Y=Green))\)
\(H(Y)=-\frac {1}{2} \times log_2(\frac {1}{2})+(1-\frac {1}{2}) \times log_2(1-\frac {1}{2})=1\)
Assume split the data with \(X<1.7\), then number of observations are, 4 and 6 respectively. The first set contains only Red hence entropy is zero. The second set has 5 greens and 1 red. The entropy measure for the second set is then,
\(H(Y|X \geq 1.7)=-\frac {1}{6} \times log_2(\frac {1}{6})-\frac {5}{6} \times log_2(\frac {5}{8})=0.65\)
\(Entropy(Y \: splitted \: at \; X<1.7)= \frac {X<1.7}{10} \times 0 + \frac {X\geq 10}{10} \times 0.65\)
\(H(Y \: splitted \: X<1.7)= \frac {4}{10} \times 0 + \frac {6}{10} \times 0.65=0.39\)
\(Information\:Gain(Y | X;1.7)= 1-0.39=0.61\)
Let’s start with usual summary plot
Petal.Length and Petal.Width may be used to find the classification rules
Let’s start with usual summary plot
Sepal.Length and Sepal.Width may be used to find the classification rules
The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance.
A smaller tree with fewer splits (that is, fewer regions \(R_1\),…,\(R_J\) ) might lead to lower variance and better interpretation at the cost of a little bias.
One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold.
This strategy will result in smaller trees, but is too short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split-that is, a split that leads to a large reduction in RSS later on.
Start with a large tree, \(T_0\) then prune it back to obtain sub-tree
Cost complexity pruning, also known as weakest link pruning, is used to do this, wehere the following objective function is minimized
\[\sum_{m=1}^{|T|} \sum_{x_i \in R_m}^{|T|} (y_i - \hat y_{R_m})^2 + \alpha |T|\] where, \(\alpha\) is a nen-negative tuning parameter, \(|T|\) is the number of terminal nodes, \(R_m\) is the subset corresponding to \(m^{th}\) terminal node, \(\hat y_{R_m}\) is the mean of the training observations in \(R_m\)
Simple to understand
For both numerical and categorical data
Not a blackbox model. Easy to interpret
Data preparation is not a problem (may require little preparation)
Close to human decision making
…
Can be very sensitive to small changes in features
Easily Overfit the data (Can create very complex trees)
May be biased based on the input variable distributions (search conditional tree over internet)