# Hubert Wang

I am
Hubert Wang

Wechat Official Account
Find fun things here!

# Machine Learning 09: Learning Representations and Unify Machine Learning Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

## Principle Components Analysis (PCA)

• Idea:
• Given data points in d-dimensional space, project into lower dimensional space while preserving as much information as possible
• E.g. find best planar approximation to 3D data
• E.g. find best planar approximation to 10^4 D data
• In particular, choose an orthogonal projection that minimizes the squared error in reconstructing original data
• Like auto-encoding neural networks, learn representation of input data that can best reconstruct it
• PCA:
• Learned encoding is linear function of inputs
• No local minimum problems when training!
• Given d-dimensional data X, learns d-dimensional representation, where
• the dimensions are orthogonal
• top k dimensions are the k-dimensional linear re-representation that minimizes reconstruction error (sum of squared errors)

### PCA: Find Projections to Minimize Reconstruction Error

Assume data is set of d-dimensional vectors, where nth vector is x^n = <x_1^n, x_2^n, ..., x_d^n>

We can represent these in terms of any d orthogonal vextors u_1 ... u_d, x^n = SUM_{i=1}^d z_i^n u_i Note we get zero error if M = d, so all error is due to missing components.  ### PCA Algorithm 1

1. X ← Create N x d data matrix, with

one row vector x^n per data point

1. X ← subtract mean x^- from each row

vector x^n in X

1. Σ ← covariance matrix of X
2. Find eigenvectors and eigenvalues of Σ
3. PC’s ← the M eigenvectors with largest eigenvalues

PCA Example What if large dimensions like images? 10^4 dimensions → 10^8 covariance matrix!

SVD can solve this.

### SVD ## Independent Components Analysis (ICA)

Find a linear transformation x = V · s, for which coefficients s = (s1, s2, …, sD)^T are sraristically independent: p(s1, s2, ..., sD) = p1(s1) p2(s2) ... pD(sD).

Algorithmically, we need to identify matrix V and coefficients s, s.t. under the condition x = V^T · s the mutual information between s1,s2,…,sD is minimized. • PCA finds directions of maximum variation
• Practically, draw the ellipse of data then find the direction and magnitude arrow.
• ICA would find directions most "aligned" with data

## Canonical Correlation Analysis (CCA)

Idea: Learning Shared Representation across datasets.

E.g. when several people are thinking "bottle", there should be some representations in brain similar. So we can use CCA to analysis the shared representations. ## Matrix Factorization Compare the results from this to PCA, the words with similar meanings will be grouped.

• PCAcomponents
• well, long, if, year, watch
• plan, engine, e, rock, very
• get, no, features, music, via
• features, by, links, free, down
• works, sound, video, building, section
• NNSEcomponents
• inhibitor, inhibitors, antagonists, receptors, inhibition
• bristol, thames, southampton, brighton, poole
• delhi, india, bombay, chennai, madras
• pundits, forecasters, proponents, commentators, observers – nosy, averse, leery, unsympathetic, snotty

## TBD: The Unified Loss Function of Logistic Regression, AdaBoost, and SVM

SO COOL!!! [to be added soon]

1233
TOC • 