Machine Learning 09: Learning Representations and Unify Machine Learning
Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017
Principle Components Analysis (PCA)
- Idea:
- Given data points in d-dimensional space, project into lower dimensional space while preserving as much information as possible
- E.g. find best planar approximation to 3D data
- E.g. find best planar approximation to 10^4 D data
- In particular, choose an orthogonal projection that minimizes the squared error in reconstructing original data
- Like auto-encoding neural networks, learn representation of input data that can best reconstruct it
- PCA:
- Learned encoding is linear function of inputs
- No local minimum problems when training!
- Given d-dimensional data X, learns d-dimensional representation, where
- the dimensions are orthogonal
- top k dimensions are the k-dimensional linear re-representation that minimizes reconstruction error (sum of squared errors)
PCA: Find Projections to Minimize Reconstruction Error
Assume data is set of d-dimensional vectors, where nth vector is x^n = <x_1^n, x_2^n, ..., x_d^n>
We can represent these in terms of any d orthogonal vextors u_1 ... u_d
, x^n = SUM_{i=1}^d z_i^n u_i
Note we get zero error if M = d, so all error is due to missing components.
PCA Algorithm 1
- X ← Create N x d data matrix, with
one row vector x^n per data point
- X ← subtract mean x^- from each row
vector x^n in X
- Σ ← covariance matrix of X
- Find eigenvectors and eigenvalues of Σ
- PC’s ← the M eigenvectors with largest eigenvalues
PCA Example
What if large dimensions like images? 10^4 dimensions → 10^8 covariance matrix!
SVD can solve this.
SVD
Independent Components Analysis (ICA)
Find a linear transformation x = V · s
, for which coefficients s = (s1, s2, …, sD)^T are sraristically independent: p(s1, s2, ..., sD) = p1(s1) p2(s2) ... pD(sD)
.
Algorithmically, we need to identify matrix V and coefficients s, s.t. under the condition x = V^T · s
the mutual information between s1,s2,…,sD is minimized.
- PCA finds directions of maximum variation
- Practically, draw the ellipse of data then find the direction and magnitude arrow.
- ICA would find directions most "aligned" with data
Canonical Correlation Analysis (CCA)
Idea: Learning Shared Representation across datasets.
E.g. when several people are thinking "bottle", there should be some representations in brain similar. So we can use CCA to analysis the shared representations.
Matrix Factorization
Compare the results from this to PCA, the words with similar meanings will be grouped.
- PCAcomponents
- well, long, if, year, watch
- plan, engine, e, rock, very
- get, no, features, music, via
- features, by, links, free, down
- works, sound, video, building, section
- NNSEcomponents
- inhibitor, inhibitors, antagonists, receptors, inhibition
- bristol, thames, southampton, brighton, poole
- delhi, india, bombay, chennai, madras
- pundits, forecasters, proponents, commentators, observers – nosy, averse, leery, unsympathetic, snotty
TBD: The Unified Loss Function of Logistic Regression, AdaBoost, and SVM
SO COOL!!! [to be added soon]