Machine Learning 01: Decision Tree, Entropy, and Overfitting
Course Notes of Professor Tom Mitchell Machine Learning @ CMU
Machine Learning is the study of algorithms that:
 Improve their performance P
 At some task T
 With experience E
The welldefined learning task: <P,T,E>
.
Function Approximation and Decision Tree Learning
[Wikipedia] A decision tree is a flowchartlike structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decisiontaken after computing all attributes).
 Set of possible instances
X
 Unknown target function
f
:X>Y
 Set of candidate hypotheses
H={ h  h : X>Y }
 Training examples
{<x^(i), y^(i)>}
of unknown target functionf
 Hypothesis
h ∈ H
that best approximates target functionf
Note that decision tree:
Y
must be discrete values.X
each instancex
inX
is a feature vectorx = < x1, x2 … xn>
. Each
f
get is a tree.
How to generate a decision tree?
A topdown induction of decision trees given by [ID3, C4.5, Quinlan] is a greedy method:
Main loop:
A
< the "best" decision attribute for the nextnode
 Assign
A
as decision attribute fornode
 For each value of
A
, create new descendant ofnode
 Sort training examples to leaf nodes
 If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes
The current known best solution is by calculating Entropy.
Entropy is measuring how ununiformed the data are. Here is an example.
The formula to calculate Entropy is:
Note that the n
in the function is the # of possible values for X
.
Also, H(X)
is the expected number of bits needed to encode a randomly drawn value of X
(under most efficient code).
The thrid formula above is the expectation of the conditional entropy.
When using the theory of Entropy and Information Gain into deciding attribute choice for constructing a decision tree, we should choose the the attribute that decrease the ununiform of labels most.
We specify Information Gain here as the mutual information I(A,Y)
between node's attribute Y
and target label variable Y
.
Information gain is the expected reduction in entropy of target label variable Y
for data sample S
, due to sorting on attribute variable A
.
For the left tree above, we can compute the Entropies of parent node: [29+,35]
as well as children nodes: [21+,5],[8+,30]
. Then substract the chidren's entropy from the parent's finally we get the Information Gain of choosing attribute A1
as the attribute for the parent node. By the same way, we can calculate the Information Gain of choosing A2
and decide to take the attribute that has the largest Information Gain to construct the tree ASAP.
The left one, humidity.
A pratice is to use Occam's razor: prefer the simplest hypothesis that fits the data.
Why the simplest is the best in most cases?An interesting idea mentioned in Machine Learning book by [Zhihua Zhou] is that: the simplest may be not the best, the match one is the best. Actually, the different models got by machine learning training are corresponding to corresponding inductive preferences, that is, the algorithm thinks which model is a better fit for data. In practice, the model with inductive preference that fits the question itself has better results. So it is meaningless saying an algorithm is good without giving specific problem.
Another point in the book is also interesting: even Occam's razor itself is not easy to use: there are situations that two different models have exactly the same number of parameters, then which one is better?
Big Map
Assume there are n
instances x
in input vector X
, and each instance represent a boolean (mean only two possible values). The output value Y
is also a boolean. So how many possible number of distinct X
can we get? And *how many number (the upper bound) of possible decision trees (hypothesis) that fits the data can we get?*
2^n, and 2^(2^n).
This is a hypothesis H = {f: X>Y}
in big map. It is important and needs to be remembered.
For big picture:
Evolution takes care of it. Tom's hypothesis of evolution building smart brains: Image that evolution wants to build brain very smart, but it just has limit of the brain size, so it cannot develop very complex sensors for us to use. Since the "simple" is the thing that have to be, then the combination of simple sensors should have the ability to represent the complex sensors, then building smart brain with limit space can be possible. It's kind of a selffulfillment.
Overfitting in Decision Tree
Consider adding noisy training example #15:
Sunny, Mild, Normal, Strong, PlayTennis = No
Overfitting.
Consider a hypothesis h
and its
 Error rate over training data:
error_train(h)
 True error rate over all data:
error_true(h)
We say h
overfits the training data if error_true(h) > error_train(h)
Amount of overfitting = error_true(h)  error_train(h)
Avoid Overfitting
 Stop growing when data split not statistically significant
 grow fulltree, then prune the tree
ReducedError Pruning
 Split data into training and validation set
 Learn a tree that classifies training set correctly

Do until further pruning is harmful:
 For each nonleaf node, evaluate impact on validation set of converting it to a leaf node
 Greedily select the node that would most improve validation set accuracy, and convert it to a leaf
 this produces smallest version of most accurate (over the validation set) subtree
The pruning process even more supports that the small trees are better than the large trees.
Decision Forests
 learn a collection of many trees
 classify by taking a weighted vote of the trees
Empirically successful. Widely used in industry.
 human pose recognition in Microsoft kinect
 medical imaging – cortical parcellation
 classify disease from gene expression data
 Train on different random subsets of data
 Randomize the choice of decision nodes
This methodology has good effect of overcomes overfitting.
You should know
 Well posed function approximation problems:
 Instance space,
X
 Sample of labeled training data {
}  Hypothesis space,
H = { f: X>Y }
 Instance space,
 Learning is a search/optimization problem over
H
 Various objective functions to define the goal
 minimize training error (01 loss)
 minimize validation error (01 loss)
 among hypotheses that minimize error, select smallest (?)
 Decision tree learning
 Greedy topdown learning of decision trees (ID3, C4.5, ...)
 Overfitting and postpruning
 Extensions… to continuous values, probabilistic classification
 Widely used commercially: decision forests
Probablistic Function Approximation
Instead of
F: X > Y
Learn:P(YX)
Informally, A
is a random variable if
A
denotes something about which we are uncertain perhaps the outcome of a randomized experiment
Examples
 A = True if a randomly drawn person from our class is female
 A = The hometown of a randomly drawn person from our class
 A = True if two randomly drawn persons from our class have same birthday
Define P(A)
as “the fraction of possible worlds in which A
is true” or “the fraction of times A
holds, in repeated runs of the random experiment”
 the set of possible worlds is called the sample space,
S
 A random variable
A
is a function defined overS
,A: S > {0,1}
More formally, we have
 a sample space
S
(e.g., set of students in our class) aka the set of possible worlds
 a random variable is a function defined over the sample space
 Gender: S > { m, f }
 Height: S > Reals
 an event is a subset of S
 e.g., the subset of S for which Gender=f
 e.g., the subset of S for which (Gender=m) AND (Height > 2m)
 we’re often interested in probabilities of specific events
 and of specific events conditioned on other specific events
The Axioms of Probability
 0 <= P(A) <= 1
 P(Ture) = 1
 P(False) = 0
 P(A or B) = P(A) + P(B)  P(A and B)
The Chain Rule:
P(A ^ B) = P(AB) P(B)