Machine Learning 08: Reinforcement, Semisupervised, and Ensemble
 Reinforcement Learning: AlphaGo
 An example: Selfdriving Car
 Semisupervised Learning
 NELL: NeverEnding Language Learner
 Resources of Semisupervised
 Ensemble Methods
 Halving Algorithm
 Weighted Majority Algorithm
 Boosting
Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017
Reinforcement Learning: AlphaGo
This is a Guest lecture by Dr. Igor Labutov
Deep Mind, 2016
 Learning task:
 chose move at arbitrary board states
 Training:
 initially supervised: trained to mimic 30 millionexpert book moves
 then reinforcement learning: played manygames against itself
 Algorithm:
 reinforcement learning + neural network
 Oct 2015 version used: 1202 CPUs, 176 GPUs.
 Result:
 Worldclass Go player
Deep Mind, 2017
 Mastering the game of Go without human knowledge, Link
 Improvement:
 Training without mimicing human
An example: Selfdriving Car

No (enough) labeled data;

The next state may be relying on the previoud state (both input and output).
Instead of minizing the total loss, reinforcement learning intends to maximize rewards.
The equation at the last is very interesting:
 Consider the proability of state transit
P(S_1 = i  S_0, a_0)
;  Consider all possible next states
\sum_{i∈S}
；  Reward of taking this step:
R(S_1 = i, S_0, a_0)
;  Also intend to maximize future total rewards:
TR_{1...k}
;  Maximize all rewards in the end:
max{...}
.
As shown in the equation above, the implementation of the total rewards is actually a recursive function.
import gym
from code import interact
cache = {}
def max_reward(s, k, k_max):
Q_sak = [ 0. ] * env.action_space.n # reward array for every possible action in action space
if (s,k) in cache:
return cache[s,k]
if k = k_max:
return 0.0
for a in range(env.action_space.n):
for prob, s_next, reward, _ = env.env.P[s][a]:
Q_sak[a] += prob * (reward + max_reward(s_next, k+1, k_max))
max_Q = max(Q_sak)
cache[s,k] = max_Q
return max_Q
k = 0 #current step
k_max = 10 #maximum steps
s0 = 0 #initial state
for s in range(env.observation_space.n):
print max_reward(s, 0, k_max)
There is situation that we want the process to go forward all the time instead of setting limit k_max steps.
It equals to choose a Q without know the probability of state transfer.
Semisupervised Learning
When can unlabeled data help supervised learning?
unlabled data is plentiful, labeled data is expensive.
 Medical outcomes (x=
, y=outcome)  Text classification (x=document, y=relevance)
 Customer modeling (x=user actions, y=user intent)
 Sensor interpretation (x=
Problem setting (the PAC learning setting):
 Set X of instances drawn from unknown distribution P(X)
 Wish to learn target function f: X → Y (or, P(YX))
 Given a set H of possible hypotheses for f
Given:
 i.i.d. labeled examples
L = {<x1, y1> ... <x_m, y_m>}
 i.i.d. unlabeled examples
U = {x_{m+1}, ... x_{m+n}}
Wish to find hypothesis with lowest true error:
f^ ← argmin_{h∈H} Pr[h(x)≠f(x)]
Idea 1: Use U to reweight labeled examples
 Most learning algorithms minimize errors over labeled examples
 But we really want to minimize true error
 If we know the underlying distribution P(X), we could weight each labeled training example
by its probability according to P(X=x)  Unlabeled data allows us to estimate P(X)
Idea 2: Use labeled and unlabeled data to train Bayes Net for P(X, Y)
Recall that in the Bayes Net part, we've discussed EM algorithm which can training hypothesis even some attributes are missing:
Summary : Semisupervised Learning with EM and Naïve Bayes Model
 If all data unlabeled, corresponds to unsupervised, mixtureofmultinomial clustering
 If both labeled and unlabeled data, then unlabeled data helps if the Bayes net modeling assumptions are correct (e.g., P(X) is a mixture of classconditional multinomials with conditionally independent X_i )
 Of course we could use Bayes net models other than Naive Bayes
 Can unlabeled data be useful even if Bayes net makes no conditional independence assumptions?
 The weaker the assumption, the less information we can milk from the unlabeled data.
Idea 2.5: Similarly use labeled and unlabeled data to train Support Vector Machine
Transductive Support Vector Machines:
Optimize for the separator with large margin wrt labeled and unlabeled data.
Heuristic (Joachims) high level idea:
 First maximize margin over the labeled points
 Use this to give initial labels to unlabeled points based on this separator.
 Try flipping labels of unlabeled points to see if doing so can increase margin
Keep going until no more improvements. Finds a locallyoptimal solution.
Idea 3: CoTraining, Coupled Training
 When learning f: X → Y, sometimes available features of X are redundantly sufficient to predict Y. We can then train two classifiers based on disjoint subsets of X
 Of course these two classifiers should agree on the classification for each unlabeled example
 Therefore, we can use the unlabeled data to constrain joint training of both classifiers
CoTraining Algorithm
CoTraining Summary
 Unlabeled data improves supervised learning when example features are redundantly sufficient
 Family of algorithms that train multiple classifiers
 Theoretical results
 If X1,X2 conditionally independent given Y, Then
 PAC learnable from weak initial classifier plus unlabeled data
 disagreement between g1(x1) and g2(x2) bounds final classifier error
 Many realworld problems of this type
 Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99] – Web page classification [Blum, Mitchell 98]
 Word sense disambiguation
 Speech recognition [de Sa, Ballard 98]
 Visual classification of cars [Levin, Viola, Freund 03]
NELL: NeverEnding Language Learner
The task:
 run 24x7, forever

each day:
 extract more facts from the web to populate the ontology
 learn to read (perform #1) better than yesterday
Inputs:
 initial ontology (categories and relations)
 dozen examples of each ontology predicate
 the web
 occasional interaction with human trainers
Learning from Unlabeled Data in NELL
 Key Idea 1: Coupled semisupervised training of 1000’s of functions
 Key Idea 2: Discover New Coupling Constraints
Full slides about NELL in this lecture: PDF p2653
Resources of Semisupervised
 SemiSupervised Learning, O. Chapelle, B. Sholkopf, and A. Zien(eds.), MIT Press, 2006. (book)
 SemiSupervised Learning. Encyclopedia of Machine Learning. Jerry Zhu, 2010
 EM for Naïve Bayes classifiers: K.Nigam, et al., 2000. "Text Classification from Labeled and Unlabeled Documents using EM", Machine Learning, 39, pp.103—134.
 CoTraining: A. Blum and T. Mitchell, 1998. “Combining Labeled and Unlabeled Data with CoTraining,” Proceedings of the 11th Annual Conference on Computational Learning Theory(COLT98).
 Never Ending Learning, T. Mitchell et al., CACM, Dec. 2017.
 Model selection: D. Schuurmans and F. Southey, 2002. “MetricBased methods for Adaptive Model Selection and Regularization,” Machine Learning, 48, 51—84.
Ensemble Methods
 Key idea: k hypotheses might be better than one
 Examples:
 Candidate elimination algorithm (aka Halving Algorithm)
 Bayes optimal classifier
 Weighted Majority algorithm
 AdaBoost
 Decision forests
Halving Algorithm
CandidateElimination Algorithm
Candidate Elimination Algorithm (aka Halving Algorithm)
 Initialize Version Space VS <— H
 For each training example
 remove from VS every hypothesis that misclassifies
 remove from VS every hypothesis that misclassifies
Note remaining candidate hypotheses have zero training error
Classify new example x:
 every hypothesis remaining in VS votes equally
 y <— majority vote
Mistake Bounds: Halving Algorithm
Consider the Halving Algorithm:
 Learn concept using version space VS CandidateElimination Algorithm
 Classify new instances by majority vote of version space members
How many mistakes before converging to correct h
?
 In worst case: initially VS = H, after one mistake, most of the hypothesis are wrong, VS <= 1/2 H, after k mistakes, VS <= (1/2)^k H, so we get the number of mistakes' upper bound:
k <= log2H
 In best case: no mistake will be made! so
k=0
Optimal Mistake Bounds
Conclusion: VC(C) is the lower bound.
Weighted Majority Algorithm
Halving algorithm directly abandon the hypothesis if misclassified one example, weighted majority algorithm improved this by giving penalty instead of setting weight to 0.
The q_0
is the weight mass for classifying the example as 0.
Proof:
Solving k in this equation, we can get the algorithms above.
Boosting
 Weighted Majority algorithm learns weight foreach given predictor/hypothesis
+ It’s an example of an ensemble method: combines predictions of multiple hypotheses
 Boosting learns weight, and also the hypotheses
 Leads to one of the most popular learning methods in practice: Decision Forests
Key idea: [Rob Schapire]
 Use a learner that produces betterthanchance h(x)’s
 Train it multiple times, on reweighted training examples
 Each time, upweight the incorrectly classified examples, downweight the correctly classified examples
 Final prediction: weighted vote of the multiple hi(x)’s
This idea is practically useful and theoretically interesting.
A toy example
Training Error
The detailed formula derivation: PDF
Actual Typical Run and Theory of Margins