Machine Learning 07: Kernel Regression and SVM
Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017
Regression
Wish to learn f: X—>Y, where Y is real, given {
Approach:
- choose some parameterized form for P(Y|X; θ)
( θ is the vector of parameters)
- derive learning algorithm as MCLE or MAP estimate for θ
1. Choose parameterized form for P(Y|X; θ)
Let’s assume Y is some deterministic f(X), plus random noise
y = f(x) + ε, where ε ~ N(0, σ)
Therefore Y is a random variable that follows the distribution
p(y|x) = N(f(x), σ)
and the expected value of y for any given x is f(x)
2. Derive learning algorithm as MCLE or MAP estimate for θ
Consider Linear Regression
p(y|x; W) = N(w0 + w1x, σ^2)
How can we learn W from the training data?
Learn Maximum Conditional Likelihood Estimate!
If we assume zero-mean Gaussian prior on each w_i:
This is also called Hinge Loss.
But what if y=f(x) is non-linear?
Let's start by learning Kernel Functions
Kernel Functions
- Kernel functions provide a way to manipulate data as though it were projected into a higher dimensional space, by operating on it in its original space
- This leads to efficient algorithms that learn
- linear regression in implicit high dimension spaces, corresponding to non-linear regression in their actual input space
- linear decision surfaces in implicit high dimension spaces, corresponding to nonlinear classifiers in their actual input space
- And is a key component of algorithms such as
- Support Vector Machines
- kernel regression
- kernel PCA
- kernel CCA
Regularized Linear Regression
Wish to learn f: X —> Y, where X =
Note that in this course, we use
Vectors, Data Points, Inner Products
We draw the x and w in coordinate system. if U1 = unit vector [1, 0], and U2 = unit vector [0, 1], then w can be represented by U1 and U2. In the same way, if we use two training samples x1 and x2 to replace U1 and U2, w can be represented by two training examples x1 and x2: w = x1α1 + x2α2
.
It means that the weight we are going to learn can be represented by training examples' variables.
It also means that we can represent the weight in the sapce spanned by the variables <x1 … xn>
.
Primary form and Dual form
With the idea above, we can transform the primary form to the dual form, which is only represented by x without w appears.
Key idea: Dual form expresses linear regression f(x) in terms of dot products with training examples.
-> more efficient computation when features per example > # examps
-> turns training and application of learned model into dot products, which allows using kernel functions
[]
The derive of dual form is very simple: take the truth that w can be represented by x, and put this into f(x) = <x, w>
.
Note that XX^T
is actually is a matrix of m x m
.
Kernel
- Many learning tasks are framed as optimization problems
- Primal and Dual formulations of optimization problems
- Dual version framed in terms of dot products between x’s
- Kernel functions k(x,y) allow calculating dot products <Φ(x),Φ(y)> without bothering to project x into Φ(x)
- Leads to major efficiencies, and ability to use very high dimensional (virtual) feature spaces
Example: Quadratic Kernel
Suppose we have data originally in 2D, but project it into 3D using Φ(x).
But we can use the following kernel function to calculate inner products in the projected 3D space, in terms of operations in the 2D space. (It actually avoid transforming into 3D calculation!)
And use it to train and apply our regression function, never leaving 2D space:
Some Common Mercer Kernels
Properties of Kernels
Theorem (Mercer):
K is a kernel if and only if:
- K is symmetrics
- For any sey of training points x1, x2, … , xm, and for any a1, a2, … , am ∈ R, we have:
Kernel Closure Properties
Easily create new kernels using basic ones!
Simple Kernel Based Classifier
Consider finding the centres of mass of positive and negative examples and classifying a test point by measuring which is closest
We can express as a function of kernel evaluations.
Support Vector Machines
- new loss function (maximize margin)
- use kernels to achieve non-linear boundaries
The decision surface is decided by the margin. The margin is the distance from decision surface to the closest points.
How did we got the value of margin in the following diagram?
Note that the W is always perpendicular to the decision surface! Shown as below.
The margin is gotton by maximize the margin as shown in the blue rectangle above.
Linear hyperplane is fully defined by “support vectors” *
Moving other points a little doesn’t effect the decision boundary
only need to store the support vectors to predict labels of new points
How many support vectors in linearly separable case, given d dimensions? <= d+1
SVM Decision Surface using Gaussian Kernel
Although SVM can only get linear decision surface in 2D space, but with kernel, we can map it to other spaces (like the Gaussian space above), in these cases, the decision surface can be very complex.
SVM with Soft Margins
What if the examples are not perfectly splitable? For example:
Allow "error" in classification.
It seems like to add a "penalty" item to the minimize goal if the example is within hyperplane or even in the other side.
SVM Soft Margin Decision Surface using Gaussian Kernel
Comparing the performance of SVM and Logistic Regression. If no kernel is used, SVM and LR performs very similar because both of them give linear decision boundary. But if you use kernel with SVM, it usually performs better since non-linear decision surface is given.
SVM Summary
- Objective: maximize margin between decision surface and data
- Primal and dual formulations
- dual represents classifier decision in terms of support vectors
- Kernel SVM’s
- learn linear decision surface in high dimension space, working in original low dimension space
- Handling noisy data: soft margin “slack variables”
- again primal and dual forms
- SVM algorithm: Quadratic Program optimization
- single global optimum