Hubert Wang

I am
Hubert Wang

Wechat Official Account
Find fun things here!

Machine Learning 08: Reinforcement, Semi-supervised, and Ensemble

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

Reinforcement Learning: AlphaGo

This is a Guest lecture by Dr. Igor Labutov

Deep Mind, 2016

  • Learning task:
    • chose move at arbitrary board states
  • Training:
    • initially supervised: trained to mimic 30 millionexpert book moves
    • then reinforcement learning: played manygames against itself
  • Algorithm:
    • reinforcement learning + neural network
    • Oct 2015 version used: 1202 CPUs, 176 GPUs.
  • Result:
    • World-class Go player

Deep Mind, 2017

  • Mastering the game of Go without human knowledge, Link
  • Improvement:
    • Training without mimicing human

An example: Self-driving Car

What's the difference between predicting action in self-driving car and classifying image net?

  1. No (enough) labeled data;

  2. The next state may be relying on the previoud state (both input and output).

Instead of minizing the total loss, reinforcement learning intends to maximize rewards.

The equation at the last is very interesting:

  1. Consider the proability of state transit P(S_1 = i | S_0, a_0);
  2. Consider all possible next states \sum_{i∈S}
  3. Reward of taking this step: R(S_1 = i, S_0, a_0);
  4. Also intend to maximize future total rewards: TR_{1...k};
  5. Maximize all rewards in the end: max{...}.

As shown in the equation above, the implementation of the total rewards is actually a recursive function.

import gym
from code import interact

cache = {}
def max_reward(s, k, k_max):
    Q_sak = [ 0. ] * env.action_space.n # reward array for every possible action in action space
    if (s,k) in cache:
        return cache[s,k]
    if k = k_max:
        return 0.0
    for a in range(env.action_space.n):
        for prob, s_next, reward, _ = env.env.P[s][a]:
            Q_sak[a] += prob * (reward + max_reward(s_next, k+1, k_max))
    max_Q = max(Q_sak)
    cache[s,k] = max_Q
    return max_Q
k = 0 #current step
k_max = 10 #maximum steps
s0 = 0 #initial state

for s in range(env.observation_space.n):
    print max_reward(s, 0, k_max)

What's the problem?

There is situation that we want the process to go forward all the time instead of setting limit k_max steps.

What if we don't have P(Si|S{i-1}, a)?

It equals to choose a Q without know the probability of state transfer.

Semi-supervised Learning

When can unlabeled data help supervised learning?

unlabled data is plentiful, labeled data is expensive.

  • Medical outcomes (x=, y=outcome)
  • Text classification (x=document, y=relevance)
  • Customer modeling (x=user actions, y=user intent)
  • Sensor interpretation (x=, y=who’s there)

Problem setting (the PAC learning setting):

  • Set X of instances drawn from unknown distribution P(X)
  • Wish to learn target function f: X → Y (or, P(Y|X))
  • Given a set H of possible hypotheses for f


  • i.i.d. labeled examples L = {<x1, y1> ... <x_m, y_m>}
  • i.i.d. unlabeled examples U = {x_{m+1}, ... x_{m+n}}

Wish to find hypothesis with lowest true error:

f^ ← argmin_{h∈H} Pr[h(x)≠f(x)]

Note unlabeled data helps us estimate P(X).

Idea 1: Use U to reweight labeled examples

  • Most learning algorithms minimize errors over labeled examples
  • But we really want to minimize true error
  • If we know the underlying distribution P(X), we could weight each labeled training example by its probability according to P(X=x)
  • Unlabeled data allows us to estimate P(X)

For me, it is like to evaluate each labeled data's importance by the probability of happening, i.e. the frequently happening examples matter more in training. The unlabeled data helps to improve the frequency estimation.

Idea 2: Use labeled and unlabeled data to train Bayes Net for P(X, Y)

Recall that in the Bayes Net part, we've discussed EM algorithm which can training hypothesis even some attributes are missing:

Summary : Semisupervised Learning with EM and Naïve Bayes Model

  • If all data unlabeled, corresponds to unsupervised, mixture-of-multinomial clustering
  • If both labeled and unlabeled data, then unlabeled data helps if the Bayes net modeling assumptions are correct (e.g., P(X) is a mixture of class-conditional multinomials with conditionally independent X_i )
  • Of course we could use Bayes net models other than Naive Bayes
  • Can unlabeled data be useful even if Bayes net makes no conditional independence assumptions?
    • The weaker the assumption, the less information we can milk from the unlabeled data.

Idea 2.5: Similarly use labeled and unlabeled data to train Support Vector Machine

Transductive Support Vector Machines:

Optimize for the separator with large margin wrt labeled and unlabeled data.

Heuristic (Joachims) high level idea:

  • First maximize margin over the labeled points
  • Use this to give initial labels to unlabeled points based on this separator.
  • Try flipping labels of unlabeled points to see if doing so can increase margin

Keep going until no more improvements. Finds a locally-optimal solution.

Idea 3: CoTraining, Coupled Training

  • When learning f: X → Y, sometimes available features of X are redundantly sufficient to predict Y. We can then train two classifiers based on disjoint subsets of X
  • Of course these two classifiers should agree on the classification for each unlabeled example
  • Therefore, we can use the unlabeled data to constrain joint training of both classifiers

CoTraining Algorithm

CoTraining Summary

  • Unlabeled data improves supervised learning when example features are redundantly sufficient
    • Family of algorithms that train multiple classifiers
  • Theoretical results
    • If X1,X2 conditionally independent given Y, Then
    • PAC learnable from weak initial classifier plus unlabeled data
    • disagreement between g1(x1) and g2(x2) bounds final classifier error
  • Many real-world problems of this type
    • Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99] – Web page classification [Blum, Mitchell 98]
    • Word sense disambiguation
    • Speech recognition [de Sa, Ballard 98]
    • Visual classification of cars [Levin, Viola, Freund 03]

NELL: Never-Ending Language Learner

The task:

  • run 24x7, forever
  • each day:

    1. extract more facts from the web to populate the ontology
    2. learn to read (perform #1) better than yesterday


  • initial ontology (categories and relations)
  • dozen examples of each ontology predicate
  • the web
  • occasional interaction with human trainers

Learning from Unlabeled Data in NELL

Coupled training of thousands of functions

  • Key Idea 1: Coupled semi-supervised training of 1000’s of functions
  • Key Idea 2: Discover New Coupling Constraints

Full slides about NELL in this lecture: PDF p26-53

Resources of Semi-supervised

  • Semi-Supervised Learning, O. Chapelle, B. Sholkopf, and A. Zien(eds.), MIT Press, 2006. (book)
  • Semi-Supervised Learning. Encyclopedia of Machine Learning. Jerry Zhu, 2010
  • EM for Naïve Bayes classifiers: K.Nigam, et al., 2000. "Text Classification from Labeled and Unlabeled Documents using EM", Machine Learning, 39, pp.103—134.
  • CoTraining: A. Blum and T. Mitchell, 1998. “Combining Labeled and Unlabeled Data with Co-Training,” Proceedings of the 11th Annual Conference on Computational Learning Theory(COLT-98).
  • Never Ending Learning, T. Mitchell et al., CACM, Dec. 2017.
  • Model selection: D. Schuurmans and F. Southey, 2002. “Metric-Based methods for Adaptive Model Selection and Regularization,” Machine Learning, 48, 51—84.

Ensemble Methods

  • Key idea: k hypotheses might be better than one
  • Examples:
    • Candidate elimination algorithm (aka Halving Algorithm)
    • Bayes optimal classifier
    • Weighted Majority algorithm
    • AdaBoost
    • Decision forests

Halving Algorithm

Candidate-Elimination Algorithm

Candidate Elimination Algorithm (aka Halving Algorithm)

  • Initialize Version Space VS <— H
  • For each training example
    • remove from VS every hypothesis that misclassifies

Note remaining candidate hypotheses have zero training error

Classify new example x:

  • every hypothesis remaining in VS votes equally
  • y <— majority vote

Mistake Bounds: Halving Algorithm

Consider the Halving Algorithm:

  • Learn concept using version space VS Candidate-Elimination Algorithm
  • Classify new instances by majority vote of version space members

How many mistakes before converging to correct h?

  • In worst case: initially |VS| = |H|, after one mistake, most of the hypothesis are wrong, |VS| <= 1/2 |H|, after k mistakes, |VS| <= (1/2)^k |H|, so we get the number of mistakes' upper bound: k <= |log2|H||
  • In best case: no mistake will be made! so k=0

Optimal Mistake Bounds

Conclusion: VC(C) is the lower bound.

Weighted Majority Algorithm

Halving algorithm directly abandon the hypothesis if misclassified one example, weighted majority algorithm improved this by giving penalty instead of setting weight to 0.

The q_0 is the weight mass for classifying the example as 0.


Solving k in this equation, we can get the algorithms above.


  • Weighted Majority algorithm learns weight foreach given predictor/hypothesis

  + It’s an example of an ensemble method: combines predictions of multiple hypotheses

  • Boosting learns weight, and also the hypotheses
  • Leads to one of the most popular learning methods in practice: Decision Forests

Key idea: [Rob Schapire]

  • Use a learner that produces better-than-chance h(x)’s
  • Train it multiple times, on reweighted training examples
    • Each time, upweight the incorrectly classified examples, downweight the correctly classified examples
  • Final prediction: weighted vote of the multiple hi(x)’s

This idea is practically useful and theoretically interesting.

A toy example

Training Error

The detailed formula derivation: PDF

Actual Typical Run and Theory of Margins

Write a Comment