Machine Learning 05: Graphical Models

Graphical Models
- Conditional Independence
- Margin Independence
Bayesian Networks
Bayesian Networks Definition
Conditional Independence, Revisited
Markov Blanket
- Why Markov Blanket is Useful for Inference
Probabilities in Bayes Nets
Learning of Bayes Nets
Learning CPTs from Fully Observed Data
- MLE
- MAP
Estimate θ from partly observed data
How can we learn Bayes Net graph structure?
Kullback-Leibler Divergence
Chow-Liu Algorithm
- Chow-Liu for improving Naïve Bayes
Resources

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

Graphical Models

Key Ideas:
- Conditional Independence assumptions useful to reduce # of parameters
- but Naive Bayes is extreme!
- Graphical models express sets of conditional independence assumptions via graph structure
- Graph structure plus associated parameters define joint probability distribution over sets of parameters
Why Grapical Models:
- xxx

Conditional Independence

P(thunder | rain, light) = P(thunder | light)

Margin Independence

(Any i, j) P(X = x_i | Y = y_j) = P(X = x_i)

(Any i, j) P(Y = y_j | X = x_i) = P(Y = y_j)

Bayesian Networks

Bayes Nets are convinient representation for encoding dependencies / conditional independence.

Bayesian Networks Definition

How many parameters do we need to represent the diagram?

4 for P(W | L, R) = P(W | Pa), as shown in W column in the diagram
2 for
2
2
1 for P(StormClouds | ?) = P(StormClouds)

Bayesian Networks Joint Distribution

The joint distribution over all variables:

P(X1 ... Xn) = ∏_i P(Xi | Pa(Xi))
in which, Pa(X) = Parents = Immediat Parents

So the joint distribution of the graph is:

xxxxxx

What can we say about conditional independencies in a Bayes Net?

Each node is conditionally independent of its non-descendents, given only its immediate parents.

In this case, for each node Xi actually describes P(Xi | Pa(Xi)).

According to chain rule:

P(S,L,R,T,W) = P(S) P(L|S) P(R|S,L) P(T|S,L,R) P(W|S,L,R,T)

But in a Bayes Net:

P(X1 ... Xn) = ∏_i P(Xi | Pa(Xi))

P(S,L,R,T,W) = P(S) P(L|S) P(R|S) P(T|L) P(W|L,R)
P(S=1,L=0,R=1,T=0,W=1) = P(S=1) P(L=0|S=1) P(R=1|S=1) P(T=0|L=0) P(W=1|L=0,R=1)

The last item P(W=1|L=0,R=1)=0.2 according to the table above.

Learning a Bayes Net

Bayes Network for X1…4 with NO assumed conditional independencies

P() = P(X1)P(X2|X1)P(X3|X1,X2)P(X4|X1,X2,X3)

Bayes Network for Naive Bayes

P(Y|X1...Xn) = P(X1...Xn|Y) P(Y) / P(X1...Xn)
             = P(X1|Y) P(X2|Y) ... P(Xn|Y) P(Y)

What if variables are mix of dicrete and real values?

Bayes Network for a Hidden Markov Model

Bayesian Networks Definition

A Bayes network represents the joint probability distribution over a collection of random variables.

A Bayes network is a directed acyclic graph and a set of conditional probability distributions (CPD’s)

Each node denotes a random variable
Edges denote dependencies
For each node Xi its CPD defines P(Xi | Pa(Xi))
The joint distribution over all variables is defined to be:

Conditional Independence, Revisited

We said:
- Each node is conditionally independent of its non-descendents, given its immediate parents.
Does this rule give us all of the conditional independence relations implied by the Bayes network?
- No!
- E.g., X1 and X4 are conditionally indep given {X2, X3}
- But X1 and X4 not conditionally indep given X3
- For this, we need to understand D-separation

Easy Network 1: Head to Tail

Easy Network 2: Tail to Tail

Easy Network 3: Head to Head

X and Y are conditionally independent given Z, if and only if X and Y are D-seperated by Z.

Suppose we have three sets of random variables: X, Y and Z

X and Y are D-separated by Z (and therefore conditionally indep, given Z) iff every path from every variable in X to every variable in Y is blocked.

A path from variable A to variable B is blocked by Z if it includes a node such that either:

arrows on the path meet either head-to-tail or tail-to-tail at the node and this node is in Z

the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in Z

Exercise:

X1 indep of X3 given X2? Y
X3 indep of X1 given X2? Y
X4 indep of X1 given X2? Y
X4 indep of X1 given X3? N
X4 indep of X1 given {X3, X2}? Y
X4 indep of X1 given {}? Y

Another Exercise:

a indep of b given c? N
a indep of b given f? Y
a indep of b given {}? Y

A simpler way to judge D-seperate (cond. indep. / marg. indep.): Ancestral / Moralize / Disorient / Delete Givens

Markov Blanket

It is actually a "sufficient condition rule": saying that to compute xi, we need and only need to consider the nodes in purple.

Why Markov Blanket is Useful for Inference

According to Markov Blanket, we can get:

P(h|a,c,h,g) = P(h|c,j)

Solve the equation as following:

Probabilities in Bayes Nets

All following parts in this section are based on this graph:

Prob. of joint assignment: easy

Suppose we are interested in joint assignment <F=f,A=a,S=s,H=h,N=n>

P(f,a,s,h,n) = P(f)P(a)P(s|f,a)P(h|s)P(n|s)

Prob. of marginal probabilities P(Xi): not so easy

How to calculate P(N=n)?

P(N=1) = ΣfΣaΣhΣs P(N=1,f,a,h,s)

In this equation, we need calculate 16 values to get the marginal probabilities.

Generating a sample from joint distribution: easy

How can we generate random samples drawn according to P(F,A,S,H,N)?

Any 21th century programming language has a primitive function you can call to generate random variables between 0 ~ 1 following uniform distribution.

To generate a random sample for roots of network ( F or A ):

let θ = P(F=1) # look up from CPD
r = random number drawn uniformly between 0 and 1
if r<θ then output 1, else 0

To generate a random sample for S, given F, A:

let θ = P(S=1|F=f,A=a) # look up from CPD
r = random number drawn uniformly between 0 and 1
if r<θ then output 1, else 0

Note we can estimate anything!

e.g., P(N=n) by generating many samples from joint distribution, then count the fraction of samples for which N=n

Similarly, for anything else we care about: P(F=1|H=1, N=0)

Gibbs Sampling

A commonly used method in research.

Goal: Directly sample conditional distributions P(X1,…,Xn | Xn+1, ..., Xm)

Approach:

start with arbitrary initial values for unobserved X1(0) ,…,Xn(0) (and the observed Xn+1, ..., Xm)
iterate for s=0 to a big number:

Eventually (after burn-in), the collection of samples will constitute a sample of the true P(X1,…,Xn | Xn+1, ..., Xm)

but often use every 100th sample, since iters not indep

Learning of Bayes Nets

Four categories of learning problems
- Graph structure may be known/unknown
- Variable values may be fully observed / partly unobserved
Easy case: learn parameters when graph structure is known, and training data is fully observed
Interesting case: graph known, data partly observed
Gruesome case: graph structure unknown, data partly unobserved

Learning CPTs from Fully Observed Data

Example: Consider learning the parameter:

MLE

MLE (Max Likelihood Estimate) is:

Remember why?

Maximum likelihood estimate:
In our case:

MAP

Maximum likelihood estimate
MAP estimate

Estimate θ from partly observed data

What if FAHN observed, but not S?
Can't calculate MLE:
Let X be all observed variable values (over all examples)
Let Z be all unobserved variable values
Can’t calculate MLE:
EM seeks the estimate:

EM guaranteed to find local maximum

EM Algorithm - Informally

EM is a general procedure for learning from partly observed data.

Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S} in the example)

Begin with arbitrary choice for parameters θ:

Iterate until convergence:
• E Step: use X, θ to estimate the unobserved Z values
• M Step: use X values and estimated Z values to derive a better θ

Guaranteed to find local maximum.

Each iteration increases:

EM Algorithm - Precisely

EM is a general procedure for learning from partly observed data.

Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S})

Define

Iterate until convergence:
• E Step: Use X and current θ to calculate P(Z|X,θ)
• M Step: Replace current θ by θ ← argmax Q(θ'|θ)

Example using EM

EM seeks the estimate:

here, observed X = {F,A,H,N}, unobserved Z = {S}

E Step: Use X, θ, to Calculate P(Z|X,θ)

Observed X = {F,A,H,N}, Unobserved Z = {S}

How? Bayes net inference problem.

Recall that MLE was:

More generally, Given observed set X, unobserved set Z of boolean values:

Using Unlabeled Data to Help Train Naïve Bayes Classifier

How can we learn Bayes Net graph structure?

In general case, open problem

can require lots of data (else high risk of overfitting)
can use Bayesian priors, or other kinds of prior assumptions about graph structure to constrain search

One key result:

Chow-Liu algorithm: finds “best” tree-structured network
What’s best?
- suppose P(X) is true distribution, T(X) is our tree-structured network, where X =
- Chow-Liu minimizes Kullback-Leibler divergence.

Kullback-Leibler Divergence

KL(P(X) || T(X)) is a measure of the difference between distribution P(X) and T(X)
It is asymetric, always greater or equal to 0
It is 0 iff P(X) = T(X)

Chow-Liu Algorithm

Key result: To minimize KL(P || T) over possible tree networks T approximating true P, it suffices to find the tree network T that maximizes the sum of mutual informations over its edges.

for each pair of variables A,B, use data to estimate P(A,B), P(A), and P(B)
for each pair A, B calculate mutual information

calculate the maximum spanning tree over the set of variables, using edge weights I(A,B) — given N vars, this costs only O(N2) time
add arrows to edges to form a directed-acyclic graph
learn the CPD’s for this graph

E.g. Greedy Algorithm to find Max-Spanning Tree

Note that for each iteration, we need to check no loop is formed becaused we need to get a tree structure.

Also note that there should be many other edges in the graph like A—F, here is just a simple representation.

Chow-Liu for improving Naïve Bayes

Can we use structure learning to modify Naïve Bayes cond. indep. assumptions?

If X2 = living in Pittburg, X3 = attending CMU, the assumption of cond. indep. is not perfect fit. By adding dependencies between X variables, and prune using Chow-Liu Algorithm, the dependencies between X variables can be represented and learnt.

Resources

Bishop Ch. 8

2017-11-10 20:19

4008

tomml

TOC