Hubert Wang

I am
Hubert Wang

Wechat Official Account
Find fun things here!

Machine Learning 05: Graphical Models

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

Graphical Models

  • Key Ideas:
    • Conditional Independence assumptions useful to reduce # of parameters
    • but Naive Bayes is extreme!
    • Graphical models express sets of conditional independence assumptions via graph structure
    • Graph structure plus associated parameters define joint probability distribution over sets of parameters
  • Why Grapical Models:
    • xxx

Conditional Independence

P(thunder | rain, light) = P(thunder | light)

Margin Independence

(Any i, j) P(X = x_i | Y = y_j) = P(X = x_i)

(Any i, j) P(Y = y_j | X = x_i) = P(Y = y_j)

Bayesian Networks

Bayes Nets are convinient representation for encoding dependencies / conditional independence.

Bayesian Networks Definition

How many parameters do we need to represent the diagram?


  • 4 for P(W | L, R) = P(W | Pa), as shown in W column in the diagram
  • 2 for
  • 2
  • 2
  • 1 for P(StormClouds | ?) = P(StormClouds)

Bayesian Networks Joint Distribution

The joint distribution over all variables:

P(X1 ... Xn) = ∏_i P(Xi | Pa(Xi))
in which, Pa(X) = Parents = Immediat Parents

So the joint distribution of the graph is:


What can we say about conditional independencies in a Bayes Net?

Each node is conditionally independent of its non-descendents, given only its immediate parents.

In this case, for each node Xi actually describes P(Xi | Pa(Xi)).

According to chain rule:

P(S,L,R,T,W) = P(S) P(L|S) P(R|S,L) P(T|S,L,R) P(W|S,L,R,T)

But in a Bayes Net:

P(X1 ... Xn) = ∏_i P(Xi | Pa(Xi))

P(S,L,R,T,W) = P(S) P(L|S) P(R|S) P(T|L) P(W|L,R)
P(S=1,L=0,R=1,T=0,W=1) = P(S=1) P(L=0|S=1) P(R=1|S=1) P(T=0|L=0) P(W=1|L=0,R=1)

The last item P(W=1|L=0,R=1)=0.2 according to the table above.

Learning a Bayes Net

Bayes Network for X1…4 with NO assumed conditional independencies

P() = P(X1)P(X2|X1)P(X3|X1,X2)P(X4|X1,X2,X3)

Bayes Network for Naive Bayes

P(Y|X1...Xn) = P(X1...Xn|Y) P(Y) / P(X1...Xn)
             = P(X1|Y) P(X2|Y) ... P(Xn|Y) P(Y)

What if variables are mix of dicrete and real values?

Bayes Network for a Hidden Markov Model

Bayesian Networks Definition

A Bayes network represents the joint probability distribution over a collection of random variables.

A Bayes network is a directed acyclic graph and a set of conditional probability distributions (CPD’s)

  • Each node denotes a random variable
  • Edges denote dependencies
  • For each node Xi its CPD defines P(Xi | Pa(Xi))
  • The joint distribution over all variables is defined to be:

Conditional Independence, Revisited

  • We said:
    • Each node is conditionally independent of its non-descendents, given its immediate parents.
  • Does this rule give us all of the conditional independence relations implied by the Bayes network?
    • No!
    • E.g., X1 and X4 are conditionally indep given {X2, X3}
    • But X1 and X4 not conditionally indep given X3
    • For this, we need to understand D-separation

Easy Network 1: Head to Tail

Easy Network 2: Tail to Tail

Easy Network 3: Head to Head

X and Y are conditionally independent given Z, if and only if X and Y are D-seperated by Z.

Suppose we have three sets of random variables: X, Y and Z

X and Y are D-separated by Z (and therefore conditionally indep, given Z) iff every path from every variable in X to every variable in Y is blocked.

A path from variable A to variable B is blocked by Z if it includes a node such that either:

  1. arrows on the path meet either head-to-tail or tail-to-tail at the node and this node is in Z


  1. the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in Z



  1. X1 indep of X3 given X2? Y
  2. X3 indep of X1 given X2? Y
  3. X4 indep of X1 given X2? Y
  4. X4 indep of X1 given X3? N
  5. X4 indep of X1 given {X3, X2}? Y
  6. X4 indep of X1 given {}? Y

Another Exercise:

  1. a indep of b given c? N
  2. a indep of b given f? Y
  3. a indep of b given {}? Y

A simpler way to judge D-seperate (cond. indep. / marg. indep.): Ancestral / Moralize / Disorient / Delete Givens

Markov Blanket

It is actually a "sufficient condition rule": saying that to compute xi, we need and only need to consider the nodes in purple.

Why Markov Blanket is Useful for Inference

According to Markov Blanket, we can get:

P(h|a,c,h,g) = P(h|c,j)

Solve the equation as following:

Probabilities in Bayes Nets

All following parts in this section are based on this graph:

Prob. of joint assignment: easy

Suppose we are interested in joint assignment <F=f,A=a,S=s,H=h,N=n>

P(f,a,s,h,n) = P(f)P(a)P(s|f,a)P(h|s)P(n|s)

Prob. of marginal probabilities P(Xi): not so easy

How to calculate P(N=n)?

P(N=1) = ΣfΣaΣhΣs P(N=1,f,a,h,s)

In this equation, we need calculate 16 values to get the marginal probabilities.

Generating a sample from joint distribution: easy

How can we generate random samples drawn according to P(F,A,S,H,N)?

Any 21th century programming language has a primitive function you can call to generate random variables between 0 ~ 1 following uniform distribution.

To generate a random sample for roots of network ( F or A ):

  1. let θ = P(F=1) # look up from CPD
  2. r = random number drawn uniformly between 0 and 1
  3. if r<θ then output 1, else 0

To generate a random sample for S, given F, A:

  1. let θ = P(S=1|F=f,A=a) # look up from CPD
  2. r = random number drawn uniformly between 0 and 1
  3. if r<θ then output 1, else 0

Note we can estimate anything!

e.g., P(N=n) by generating many samples from joint distribution, then count the fraction of samples for which N=n

Similarly, for anything else we care about: P(F=1|H=1, N=0)

Gibbs Sampling

A commonly used method in research.

Goal: Directly sample conditional distributions P(X1,…,Xn | Xn+1, ..., Xm)


  • start with arbitrary initial values for unobserved X1(0) ,…,Xn(0) (and the observed Xn+1, ..., Xm)
  • iterate for s=0 to a big number:


Eventually (after burn-in), the collection of samples will constitute a sample of the true P(X1,…,Xn | Xn+1, ..., Xm)

but often use every 100th sample, since iters not indep

Learning of Bayes Nets

  • Four categories of learning problems
    • Graph structure may be known/unknown
    • Variable values may be fully observed / partly unobserved
  • Easy case: learn parameters when graph structure is known, and training data is fully observed
  • Interesting case: graph known, data partly observed
  • Gruesome case: graph structure unknown, data partly unobserved

Learning CPTs from Fully Observed Data

  • Example: Consider learning the parameter:


  • MLE (Max Likelihood Estimate) is:

Remember why?

  • Maximum likelihood estimate:
  • In our case:


  • Maximum likelihood estimate
  • MAP estimate

Estimate θ from partly observed data

  • What if FAHN observed, but not S?
  • Can't calculate MLE:
  • Let X be all observed variable values (over all examples)
  • Let Z be all unobserved variable values
  • Can’t calculate MLE:
  • EM seeks the estimate:

EM guaranteed to find local maximum

EM Algorithm - Informally

EM is a general procedure for learning from partly observed data.

Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S} in the example)

Begin with arbitrary choice for parameters θ:

Iterate until convergence:
• E Step: use X, θ to estimate the unobserved Z values
• M Step: use X values and estimated Z values to derive a better θ

Guaranteed to find local maximum.

Each iteration increases:

EM Algorithm - Precisely

EM is a general procedure for learning from partly observed data.

Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S})


Iterate until convergence:
• E Step: Use X and current θ to calculate P(Z|X,θ)
• M Step: Replace current θ by θ ← argmax Q(θ'|θ)

Example using EM

EM seeks the estimate:

here, observed X = {F,A,H,N}, unobserved Z = {S}

E Step: Use X, θ, to Calculate P(Z|X,θ)

Observed X = {F,A,H,N}, Unobserved Z = {S}

How? Bayes net inference problem.

Recall that MLE was:

More generally, Given observed set X, unobserved set Z of boolean values:

Using Unlabeled Data to Help Train Naïve Bayes Classifier

How can we learn Bayes Net graph structure?

In general case, open problem

  • can require lots of data (else high risk of overfitting)
  • can use Bayesian priors, or other kinds of prior assumptions about graph structure to constrain search

One key result:

  • Chow-Liu algorithm: finds “best” tree-structured network
  • What’s best?
    • suppose P(X) is true distribution, T(X) is our tree-structured network, where X =
    • Chow-Liu minimizes Kullback-Leibler divergence.

Kullback-Leibler Divergence

  • KL(P(X) || T(X)) is a measure of the difference between distribution P(X) and T(X)
  • It is asymetric, always greater or equal to 0
  • It is 0 iff P(X) = T(X)

Chow-Liu Algorithm

Key result: To minimize KL(P || T) over possible tree networks T approximating true P, it suffices to find the tree network T that maximizes the sum of mutual informations over its edges.

  1. for each pair of variables A,B, use data to estimate P(A,B), P(A), and P(B)
  2. for each pair A, B calculate mutual information


  1. calculate the maximum spanning tree over the set of variables, using edge weights I(A,B) — given N vars, this costs only O(N2) time
  2. add arrows to edges to form a directed-acyclic graph
  3. learn the CPD’s for this graph

E.g. Greedy Algorithm to find Max-Spanning Tree

Note that for each iteration, we need to check no loop is formed becaused we need to get a tree structure.

Also note that there should be many other edges in the graph like A—F, here is just a simple representation.

Chow-Liu for improving Naïve Bayes

Can we use structure learning to modify Naïve Bayes cond. indep. assumptions?

If X2 = living in Pittburg, X3 = attending CMU, the assumption of cond. indep. is not perfect fit. By adding dependencies between X variables, and prune using Chow-Liu Algorithm, the dependencies between X variables can be represented and learnt.


  1. Bishop Ch. 8
Write a Comment