Machine Learning 03: Gaussian Naïve Bayes and Logistic Regression
Course Notes of Professor Tom Mitchell Machine Learning @ CMU
Naïve Bayes with Continuous X
In order to train a Naive Bayes classifier with continuous X
, we must estimate the mean and standard deviation of each of these Gaussians:
We must also estimate the priors on Y
as well.
Again, we can use either maximum likelihood estimates (MLE) or maximuma posteriori (MAP) estimates for these parameters. The maximum likelihood estimator for μ_ik
is:
And the maximum likelihood estimator for σ_ik^2
is:
This maximum likelihood estimator is biased, so the minimum variance unbiased estimator (MVUE) is sometimes used instead. It is:
Gaussian Naive Bayes
considera GNB based on the following modeling assumptions:
Y
is boolean, governed by a Bernoulli distribution, with parameterπ = P(Y = 1)
X = ⟨X1 … Xn⟩
, where eachXi
is a continuous random variable- For each
Xi
,P(Xi|Y = yk)
is a Gaussian distribution of the formN(μik,σi)
- For all
i
andj ≠ i
,Xi
andXj
are conditionally independent givenY
Derivation of GNB
Logistic Regression
Idea:
- Naïve Bayes allows computing
P(Y|X)
by learningP(Y)
andP(X|Y)
- Why not learn
P(Y|X)
directly?
Consider learning f: X -> Y
, where:
X
is a vector of real-valued features,< X1 … Xn >
Y
is boolean- assume all
Xi
are conditionally independent givenY
- model
P(Xi | Y = yk)
as GaussianN(µik,σi)
- model
P(Y)
as Bernoulli(π)
What does that imply about the form of P(Y|X)
?
Logistic regression more generally: Softmax
Logistic regression:
Softmax:
Training Logistic Regression: MCLE
For training data D = {
With the MCLE, Maximum Conditional Likelihood Estimate and the conditional probability of Y given X, W, we can get the goal function we want to minimize:
So Gradient Descent is acquired to get the global optimization:
Calculate partial derivative for each w_i, and iterativly optimaze w_i.
The P
in the function is the logitic regression of Y. Use the equations above to calculate. This equation is also intuitively making sense: w is optimized to the direction of minimizing the distance between the observed value Y and the expected value P(Y|X, W)
MAP for logistic regression
- For MAP, need to define prior on
W
- given
W = <w1 ... wn>
let's assume prior P(w_i) = N(0, σ) - i.e., assume zero mean, Gaussian prior for each w_i
- given
- A kind of Occam's razor (simplest is best) prior
- Helps avoid very large weights and overfitting
The process of getting MAP is very much like what we've done before.
Note that comparing with MCLE, the only difference is we add a penalty item at the end. Sometimes it is called regularization item, which can help to avoid overfitting because it reduce the speed of gradient descent.
MLE vs MAP in logistic regression
Generative vs. Discriminative Classifiers
Training classifiers involves estimating f: X -> Y, or P(Y|X)
- Generative classifiers (e.g., Naïve Bayes)
- Assume some functional form for P(X|Y), P(X)
- Estimate parameters of P(X|Y), P(X) directly from training data
- Use Bayes rule to calculate P(Y|X= xi)
- Number of parameters: 4n + 1
- Convergence rates faster, because parameters are uncoupled. n = O(log d)
- Discriminative classifiers (e.g., Logistic regression)
- Assume some functional form for P(Y|X)
- Estimate parameters of P(Y|X) directly from training data
- Number of parameters: n + 1
- Convergence rates slower, because parameters are coupled. n = O(d)
Resources
- Ng & Jordan, NB vs. LR, 2002
- Naive Bayes and Logistic Regression, Mitchell