Machine Learning 06: PAC Theory

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

Computational Learning Theory

What general laws constrain inductive learning?
Want theory to relate
- Number of training examples
- Complexity of hypothesis space
- Accuracy to which target function is approximated
- Manner in which training examples are presented
- Probability of successful learning

Function Approximation: The Big Picture

Hypothesis: h(·) is the approximation of the unknow target function f(·), with which we can get y = f(x) ≈ h(x).

Relations between hypothesis and instance:

A hypothesis shadows a subset of all instances.
Giving hypothesis H and data D. This is always true: Σ_h P(H=h|D=d) = 1.

Sample Complexity

How many examples suffice to learn target concept:

If learner proposes instances as queries to teacher?
- learner proposes x, teacher provides f(x)
If teacher (who knows f(x)) generates training examples?
- teacher proposes sequence {, … }
If some random process (e.g., nature) generates

instances, and teacher labels them?

instances drawn according to P(X)

Case One

For the first question above: If students choose x, then teacher labels.

Example:

X = points in the plane

H = rectangles, classify x positive if inside rectangle

h = if a<x1<b and c<x2<d then Y = 1, else Y = 0

It will give the smallest rectangles as the students can know the label, then grows to the largest ground truth as samples number grows.

After students gets the two "+" as shown in the diagram, students will get the inner rectangle. After the students get the four "—", students will get the outer rectangle. To get the most possible one, the student needs to ask samples outside the inner rectangle, but within the larger rectangle.

Best case: learner plays 20 questions: chooses each x so that half the remaining hypotheses label it positive, half neg. —> learn in log2 |H| queries in best case. (like binary search)

Case Two

For the second question: if teacher proposes instances and labels them.

Example:

X = points in the plane

H = rectangles, classify x positive if inside rectangle

h = if a<x1<b and c<x2<d then Y = 1, else Y = 0

How many samples do we need to get the ground truth? 6

It's natural, as the people giving labels knows the groundtruth, fewer samples are needed to get the rectangle.

Problem setting:

Set of instances X
Set of hypotheses H = {h: X —> {0,1}}
Set of possible target functions C = {c: X —> {0,1}}
Sequence of training instances drawn at random from P(X), teacher provides noise-free label c(x)

Learner outputs a hypothesis such that:

h = argmin_h∈H error_train(h)

True Error and Overfitting

The Error of a Hypothesis

The true error of h is the probability that it will misclassify an example drawn at random from P(X).

Two Notions of Error

According to the comparison above, we can see that the true error is the difference between estimates and probability distribution of X. The train error is the difference between estimates and training samples.

Overfitting

Consider a hypothesis h and its

Error rate over training data: error_train (h)
True error rate over all data: error_true (h)

We say h overfits the training data if

error_true (h) > error_train (h)

Amount of overfitting =

error_true (h) - error_train (h)

Can we bound error_true (h) in terms of error_train (h)?

Yes, we can:

if D was a set of examples drawn from P(X) and independent of h, then we could use standard statistical confidence intervals to determine that with 95% probability, error_true (h) lies in the interval:

but D is the training data for h …...

Version Spaces

Exhausting the Version Space

Definition: The version space VS_H,D with respect to training data D is said to be ϵ-exhausted if every hypothesis in VS_H,D has true error less than ϵ.

How many examples will ϵ-exhausted the VS?

Interesting! This bounds the probablity that any consistent learner will output a hypothesis h with error_true (h) >= ϵ.

Specifically:

Hypothesis space H
Target function C(X)
# of training example m
Let h1, h2, ... hk be the hypothesis in H with true error >= ϵ

To get a ϵ-exhausted version space, we need to exempt all h1 through hk.

What's the probability h1 will be consistent with one training example?

< (1 - ϵ). Because the prob. that h1 cannot correctly label the first training example is >= ϵ. So the prob. that it can correctly label the first training exmaple is < (1 - ϵ).

What's the probability that h1 will be consistent with m training examples?

< (1- ϵ)^m. Since all training examples are independent.

What's the probability that at least one of h1…hk will be consistent with m examples?

< k * (1 - ϵ)^m. Since we do not know the independence of h1…hk, we can only give an upper bound by timing k.

Also, we do know that k is less than or equal to the number of hypothesis H: k <= |H|. So we can get:

< |H|* (1 - ϵ)^m.

It is true that: if 0 <= ϵ <= 1, then (1 - ϵ) <= e^(-ϵ). So we proved:

< |H| e^(-ϵm)

Example 1

Note that in the example, X1,…X4 can be value of 0, 1, or "whatever". So the size of H: |H| = 3^4.

Also, here are some statics according to the computation: Giving , n = 10, m >= 312; n = 100, m >= 2290.

Example 2

Note that in the example, to construct the tree structure, there are N (N-1) / 2 possible combinations to get the 2-depth / 2-variable tree. As each leaf is a boolean value, so we get 2^4 combinations. In all, the size of H: |H| = N (N-1) / 2 * 2^4 = 8N(N-1).

Example 3: Depth n Decision Trees

What is the |H| = the number of hypothesis in this case?

Note that the possible combination of varibales are |X| = 2^n. Then the hypothesis numer are: |H| = 2^|X| = 2^(2^n).

Also note that in both examples, the assumption is that we can find the hypothesis with zero training error.

But what if the size of H is infinite and we cannot find a hypothesis with zero training error?

Agnostic Learning

So far, assumed c ∈ H, while Agnostic learning setting: don't assume c ∈ H

What do we want then?
- The hypothesis h that makes fewest errors on training data
What is sample complexity in this case?

This conclusion can be conducted from:

But, the output h with lowest training error might not give us the h* with lowest true error. How far can true error of h be from h* ?

VC Dimensions

Question: If H = {h | h: X —> Y} is infinite, what measure of complexity should we use in place of |H|?

Answer: The largest subset of X for which H can guarantee zero training error (regardless of how it is labeled)

— This is the VC dimension of H

Shattering a Set of Instances

Definition: a dichotomy of a set S is a partition of S into two disjoint subsets.

Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.

Equivalently Definition: If H can guarantee zero train error regardless of how they are labeled.

For example:

Assuming ⊙ is a boolean variable in S the right diagram. Then there are eight possible dichotomy of S as circled. No matter how these three ⊙ are labeled, it is matches certain hypothesis in H. In this way, the Hypothesis space H with these eight hypotheses is shattering S.

The Vapnik-Chervonenkis Dimension

Definition: The Vapnik-Chervonenkis dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large sets of X can be shattered by H, then VC(H) ≡ ∞.

In the diagram above, the VC(H) = 3.

Sample Complexity based on VC dimension

How many randomly drawn examples suffice to ε-exhaust VS_H,D with probability at least (1-δ)?

ie., to guarantee that any hypothesis that perfectly fits the training data is probably (1-δ) approximately (ε) correct.

Compare to our earlier results based on H:

VC Dimension Example 1:

Consider X = set of real numbers, want to learn c: X —> {0, 1}. What is the VC dimension of:

Open intervals:
- H1: if x > a then y = 1 else y = 0
- VC(H1) = 1. For example, if given two variables in X, this case is not satisfying the hypothesis. ———(x1 = 1)———(x2 = 0)———>
- H2: if x > a then y = 1 else y = 0 or, if x> a then y = 0 else y = 1
- VC(H2) = 2. For the two points shown below, whatever they are (1, 1) or (1, 0) or (0, 1) or (0, 0), we can always find a that suits the hypothesis!
Closed intervals:
- H3: if a < x < b then y = 1 else y = 0
- VC(H3) = 2. Imagine we got three points (1, 0, 1), then at most two points can fit.
- H4: if a < x < b then y = 1 else y = 0 or, if a < x < b then y = 0 else y = 1
- VC(H4) = 3.

VC Dimension Example 2:

What is VC dimension of lines in a plane?

H2 = { ((w0 + w1x1 + w2x2) > 0 —> y=1) }

See the 3 points on the left, there are 8 possible combinations. For any one of the combinations, we can always find a line to seperate them perfectly.

See the 3 points on the right. We found there is no possible line to seperate (0, 1, 0).

But the VC(H2) = 3. Because according to the definition, we just want to find the maximum number of possible variables.

For H_n = linear separating hyperplanes in n dimensions, VC(Hn) = n + 1.

Also note that this is exactly logistic regression so this conclusion also suits for logistic regression!!!

Question: For any finite hypothesis space H, can you give an upper bound on VC(H) in term of |H|?

Answer: Yes! Imagine VC(H) = k. —> can guarantee zero training error for k variables. —> H has at least 2^k hypothesis. So: VC(H) <= log2(|H|)

Tightness of Bounds on Sample Complexity

How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct?

How tight is this bound?

Note that the equation above gives the suffice upper bound saying that how many examples are enough (the worst case!!). But we can also get a lower bound saying how many examples are not enough (the best case!!!)

Lower bound on sample complexity (Ehrenfeucht et al., 1989):

Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 < ε < 1/8, and any 0 < δ < 0.01. Then there exists a distribution D = P(x) and a target concept in C, such that if L observes fewer examples than:

Then with probability at least δ, L outputs a hypothesis with error_D(h) > ε.

Agnostic Learning: VC Bounds

With probability at least (1- δ) every h ∈ H satisfies

Note that the right half part is the upper bound of overfitting. What if we double the number of training examples? The overfitting decreases! The line of train error and true error (test error) will get closer to each other.

What you should know about PAC

Sample complexity varies with the learning setting
- Learner actively queries trainer
- Examples arrive at random
Within the PAC learning setting, we can bound the probability that learner will output hypothesis with true error within ε of training error
- For ANY consistent learner (case where c ∈ H, zero training error)
- For ANY “best fit” hypothesis (agnostic learning, where perhaps c not in H)
VC dimension as measure of complexity of H
More results coming this semester
Conference on Learning Theory
ML Department course on Machine Learning Theory

2017-12-11 18:01

2648

tomml

TOC