# Machine Learning 09: Learning Representations and Unify Machine Learning

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

## Principle Components Analysis (PCA)

- Idea:
- Given data points in d-dimensional space, project into lower dimensional space while preserving as much information as possible
- E.g. find best planar approximation to 3D data
- E.g. find best planar approximation to 10^4 D data
- In particular, choose an orthogonal projection that
__minimizes the squared error in reconstructing original data__

- Like auto-encoding neural networks, learn representation of input data that can best
**reconstruct**it - PCA:
- Learned encoding is linear function of inputs
- No local minimum problems when training!
- Given d-dimensional data X, learns d-dimensional representation, where
- the dimensions are orthogonal
- top k dimensions are the k-dimensional linear re-representation that minimizes
**reconstruction**error (sum of squared errors)

### PCA: Find Projections to Minimize Reconstruction Error

Assume data is set of d-dimensional vectors, where nth vector is `x^n = <x_1^n, x_2^n, ..., x_d^n>`

We can represent these in terms of any d orthogonal vextors `u_1 ... u_d`

, `x^n = SUM_{i=1}^d z_i^n u_i`

Note we get zero error if M = d, so all error is due to missing components.

### PCA Algorithm 1

- X ← Create N x d data matrix, with

one row vector x^n per data point

- X ← subtract mean x^- from each row

vector x^n in X

- Σ ← covariance matrix of X
- Find eigenvectors and eigenvalues of Σ
- PC’s ← the M eigenvectors with largest eigenvalues

__PCA Example__

*What if large dimensions like images?* 10^4 dimensions → 10^8 covariance matrix!

SVD can solve this.

### SVD

## Independent Components Analysis (ICA)

Find a linear transformation `x = V · s`

, for which coefficients s = (s1, s2, …, sD)^T are sraristically independent: `p(s1, s2, ..., sD) = p1(s1) p2(s2) ... pD(sD)`

.

Algorithmically, we need to identify matrix V and coefficients s, s.t. under the condition `x = V^T · s`

the mutual information between s1,s2,…,sD is minimized.

- PCA finds directions of maximum variation
- Practically, draw the ellipse of data then find the direction and magnitude arrow.

- ICA would find directions most "aligned" with data

## Canonical Correlation Analysis (CCA)

Idea: Learning Shared Representation across datasets.

E.g. when several people are thinking "bottle", there should be some representations in brain similar. So we can use CCA to analysis the shared representations.

## Matrix Factorization

Compare the results from this to PCA, the words with similar meanings will be grouped.

- PCAcomponents
- well, long, if, year, watch
- plan, engine, e, rock, very
- get, no, features, music, via
- features, by, links, free, down
- works, sound, video, building, section

- NNSEcomponents
- inhibitor, inhibitors, antagonists, receptors, inhibition
- bristol, thames, southampton, brighton, poole
- delhi, india, bombay, chennai, madras
- pundits, forecasters, proponents, commentators, observers – nosy, averse, leery, unsympathetic, snotty

## TBD: The Unified Loss Function of Logistic Regression, AdaBoost, and SVM

SO COOL!!! [to be added soon]