Machine Learning 09: Learning Representations and Unify Machine Learning

Principle Components Analysis (PCA)
Independent Components Analysis (ICA)
Canonical Correlation Analysis (CCA)
Matrix Factorization
TBD: The Unified Loss Function of Logistic Regression, AdaBoost, and SVM

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

Principle Components Analysis (PCA)

Idea:
- Given data points in d-dimensional space, project into lower dimensional space while preserving as much information as possible
- E.g. find best planar approximation to 3D data
- E.g. find best planar approximation to 10^4 D data
- In particular, choose an orthogonal projection that minimizes the squared error in reconstructing original data
Like auto-encoding neural networks, learn representation of input data that can best reconstruct it
PCA:
- Learned encoding is linear function of inputs
- No local minimum problems when training!
- Given d-dimensional data X, learns d-dimensional representation, where
- the dimensions are orthogonal
- top k dimensions are the k-dimensional linear re-representation that minimizes reconstruction error (sum of squared errors)

PCA: Find Projections to Minimize Reconstruction Error

Assume data is set of d-dimensional vectors, where nth vector is x^n = <x_1^n, x_2^n, ..., x_d^n>

We can represent these in terms of any d orthogonal vextors u_1 ... u_d, x^n = SUM_{i=1}^d z_i^n u_i

Note we get zero error if M = d, so all error is due to missing components.

PCA Algorithm 1

X ← Create N x d data matrix, with

one row vector x^n per data point

X ← subtract mean x^- from each row

vector x^n in X

Σ ← covariance matrix of X
Find eigenvectors and eigenvalues of Σ
PC’s ← the M eigenvectors with largest eigenvalues

PCA Example

What if large dimensions like images? 10^4 dimensions → 10^8 covariance matrix!

SVD can solve this.

SVD

Independent Components Analysis (ICA)

Find a linear transformation x = V · s, for which coefficients s = (s1, s2, …, sD)^T are sraristically independent: p(s1, s2, ..., sD) = p1(s1) p2(s2) ... pD(sD).

Algorithmically, we need to identify matrix V and coefficients s, s.t. under the condition x = V^T · s the mutual information between s1,s2,…,sD is minimized.

PCA finds directions of maximum variation
- Practically, draw the ellipse of data then find the direction and magnitude arrow.
ICA would find directions most "aligned" with data

Canonical Correlation Analysis (CCA)

Idea: Learning Shared Representation across datasets.

E.g. when several people are thinking "bottle", there should be some representations in brain similar. So we can use CCA to analysis the shared representations.

Matrix Factorization

Compare the results from this to PCA, the words with similar meanings will be grouped.

PCAcomponents
- well, long, if, year, watch
- plan, engine, e, rock, very
- get, no, features, music, via
- features, by, links, free, down
- works, sound, video, building, section
NNSEcomponents
- inhibitor, inhibitors, antagonists, receptors, inhibition
- bristol, thames, southampton, brighton, poole
- delhi, india, bombay, chennai, madras
- pundits, forecasters, proponents, commentators, observers – nosy, averse, leery, unsympathetic, snotty

TBD: The Unified Loss Function of Logistic Regression, AdaBoost, and SVM

SO COOL!!! [to be added soon]

2017-12-14 20:01

2517

tomml

TOC