Machine Learning 08: Reinforcement, Semi-supervised, and Ensemble
- Reinforcement Learning: AlphaGo
- An example: Self-driving Car
- Semi-supervised Learning
- NELL: Never-Ending Language Learner
- Resources of Semi-supervised
- Ensemble Methods
- Halving Algorithm
- Weighted Majority Algorithm
- Boosting
Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017
Reinforcement Learning: AlphaGo
This is a Guest lecture by Dr. Igor Labutov
Deep Mind, 2016
- Learning task:
- chose move at arbitrary board states
- Training:
- initially supervised: trained to mimic 30 millionexpert book moves
- then reinforcement learning: played manygames against itself
- Algorithm:
- reinforcement learning + neural network
- Oct 2015 version used: 1202 CPUs, 176 GPUs.
- Result:
- World-class Go player
Deep Mind, 2017
- Mastering the game of Go without human knowledge, Link
- Improvement:
- Training without mimicing human
An example: Self-driving Car
-
No (enough) labeled data;
-
The next state may be relying on the previoud state (both input and output).
Instead of minizing the total loss, reinforcement learning intends to maximize rewards.
The equation at the last is very interesting:
- Consider the proability of state transit
P(S_1 = i | S_0, a_0)
; - Consider all possible next states
\sum_{i∈S}
; - Reward of taking this step:
R(S_1 = i, S_0, a_0)
; - Also intend to maximize future total rewards:
TR_{1...k}
; - Maximize all rewards in the end:
max{...}
.
As shown in the equation above, the implementation of the total rewards is actually a recursive function.
import gym
from code import interact
cache = {}
def max_reward(s, k, k_max):
Q_sak = [ 0. ] * env.action_space.n # reward array for every possible action in action space
if (s,k) in cache:
return cache[s,k]
if k = k_max:
return 0.0
for a in range(env.action_space.n):
for prob, s_next, reward, _ = env.env.P[s][a]:
Q_sak[a] += prob * (reward + max_reward(s_next, k+1, k_max))
max_Q = max(Q_sak)
cache[s,k] = max_Q
return max_Q
k = 0 #current step
k_max = 10 #maximum steps
s0 = 0 #initial state
for s in range(env.observation_space.n):
print max_reward(s, 0, k_max)
There is situation that we want the process to go forward all the time instead of setting limit k_max steps.
It equals to choose a Q without know the probability of state transfer.
Semi-supervised Learning
When can unlabeled data help supervised learning?
unlabled data is plentiful, labeled data is expensive.
- Medical outcomes (x=
, y=outcome) - Text classification (x=document, y=relevance)
- Customer modeling (x=user actions, y=user intent)
- Sensor interpretation (x=
Problem setting (the PAC learning setting):
- Set X of instances drawn from unknown distribution P(X)
- Wish to learn target function f: X → Y (or, P(Y|X))
- Given a set H of possible hypotheses for f
Given:
- i.i.d. labeled examples
L = {<x1, y1> ... <x_m, y_m>}
- i.i.d. unlabeled examples
U = {x_{m+1}, ... x_{m+n}}
Wish to find hypothesis with lowest true error:
f^ ← argmin_{h∈H} Pr[h(x)≠f(x)]
Idea 1: Use U to reweight labeled examples
- Most learning algorithms minimize errors over labeled examples
- But we really want to minimize true error
- If we know the underlying distribution P(X), we could weight each labeled training example
by its probability according to P(X=x) - Unlabeled data allows us to estimate P(X)
Idea 2: Use labeled and unlabeled data to train Bayes Net for P(X, Y)
Recall that in the Bayes Net part, we've discussed EM algorithm which can training hypothesis even some attributes are missing:
Summary : Semisupervised Learning with EM and Naïve Bayes Model
- If all data unlabeled, corresponds to unsupervised, mixture-of-multinomial clustering
- If both labeled and unlabeled data, then unlabeled data helps if the Bayes net modeling assumptions are correct (e.g., P(X) is a mixture of class-conditional multinomials with conditionally independent X_i )
- Of course we could use Bayes net models other than Naive Bayes
- Can unlabeled data be useful even if Bayes net makes no conditional independence assumptions?
- The weaker the assumption, the less information we can milk from the unlabeled data.
Idea 2.5: Similarly use labeled and unlabeled data to train Support Vector Machine
Transductive Support Vector Machines:
Optimize for the separator with large margin wrt labeled and unlabeled data.
Heuristic (Joachims) high level idea:
- First maximize margin over the labeled points
- Use this to give initial labels to unlabeled points based on this separator.
- Try flipping labels of unlabeled points to see if doing so can increase margin
Keep going until no more improvements. Finds a locally-optimal solution.
Idea 3: CoTraining, Coupled Training
- When learning f: X → Y, sometimes available features of X are redundantly sufficient to predict Y. We can then train two classifiers based on disjoint subsets of X
- Of course these two classifiers should agree on the classification for each unlabeled example
- Therefore, we can use the unlabeled data to constrain joint training of both classifiers
CoTraining Algorithm
CoTraining Summary
- Unlabeled data improves supervised learning when example features are redundantly sufficient
- Family of algorithms that train multiple classifiers
- Theoretical results
- If X1,X2 conditionally independent given Y, Then
- PAC learnable from weak initial classifier plus unlabeled data
- disagreement between g1(x1) and g2(x2) bounds final classifier error
- Many real-world problems of this type
- Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99] – Web page classification [Blum, Mitchell 98]
- Word sense disambiguation
- Speech recognition [de Sa, Ballard 98]
- Visual classification of cars [Levin, Viola, Freund 03]
NELL: Never-Ending Language Learner
The task:
- run 24x7, forever
-
each day:
- extract more facts from the web to populate the ontology
- learn to read (perform #1) better than yesterday
Inputs:
- initial ontology (categories and relations)
- dozen examples of each ontology predicate
- the web
- occasional interaction with human trainers
Learning from Unlabeled Data in NELL
- Key Idea 1: Coupled semi-supervised training of 1000’s of functions
- Key Idea 2: Discover New Coupling Constraints
Full slides about NELL in this lecture: PDF p26-53
Resources of Semi-supervised
- Semi-Supervised Learning, O. Chapelle, B. Sholkopf, and A. Zien(eds.), MIT Press, 2006. (book)
- Semi-Supervised Learning. Encyclopedia of Machine Learning. Jerry Zhu, 2010
- EM for Naïve Bayes classifiers: K.Nigam, et al., 2000. "Text Classification from Labeled and Unlabeled Documents using EM", Machine Learning, 39, pp.103—134.
- CoTraining: A. Blum and T. Mitchell, 1998. “Combining Labeled and Unlabeled Data with Co-Training,” Proceedings of the 11th Annual Conference on Computational Learning Theory(COLT-98).
- Never Ending Learning, T. Mitchell et al., CACM, Dec. 2017.
- Model selection: D. Schuurmans and F. Southey, 2002. “Metric-Based methods for Adaptive Model Selection and Regularization,” Machine Learning, 48, 51—84.
Ensemble Methods
- Key idea: k hypotheses might be better than one
- Examples:
- Candidate elimination algorithm (aka Halving Algorithm)
- Bayes optimal classifier
- Weighted Majority algorithm
- AdaBoost
- Decision forests
Halving Algorithm
Candidate-Elimination Algorithm
Candidate Elimination Algorithm (aka Halving Algorithm)
- Initialize Version Space VS <— H
- For each training example
- remove from VS every hypothesis that misclassifies
- remove from VS every hypothesis that misclassifies
Note remaining candidate hypotheses have zero training error
Classify new example x:
- every hypothesis remaining in VS votes equally
- y <— majority vote
Mistake Bounds: Halving Algorithm
Consider the Halving Algorithm:
- Learn concept using version space VS Candidate-Elimination Algorithm
- Classify new instances by majority vote of version space members
How many mistakes before converging to correct h
?
- In worst case: initially |VS| = |H|, after one mistake, most of the hypothesis are wrong, |VS| <= 1/2 |H|, after k mistakes, |VS| <= (1/2)^k |H|, so we get the number of mistakes' upper bound:
k <= |log2|H||
- In best case: no mistake will be made! so
k=0
Optimal Mistake Bounds
Conclusion: VC(C) is the lower bound.
Weighted Majority Algorithm
Halving algorithm directly abandon the hypothesis if misclassified one example, weighted majority algorithm improved this by giving penalty instead of setting weight to 0.
The q_0
is the weight mass for classifying the example as 0.
Proof:
Solving k in this equation, we can get the algorithms above.
Boosting
- Weighted Majority algorithm learns weight foreach given predictor/hypothesis
+ It’s an example of an ensemble method: combines predictions of multiple hypotheses
- Boosting learns weight, and also the hypotheses
- Leads to one of the most popular learning methods in practice: Decision Forests
Key idea: [Rob Schapire]
- Use a learner that produces better-than-chance h(x)’s
- Train it multiple times, on reweighted training examples
- Each time, upweight the incorrectly classified examples, downweight the correctly classified examples
- Final prediction: weighted vote of the multiple hi(x)’s
This idea is practically useful and theoretically interesting.
A toy example
Training Error
The detailed formula derivation: PDF
Actual Typical Run and Theory of Margins