Hubert Wang

I am
Hubert Wang

Wechat Official Account
Find fun things here!

Machine Learning 09: Learning Representations and Unify Machine Learning

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

Principle Components Analysis (PCA)

  • Idea:
    • Given data points in d-dimensional space, project into lower dimensional space while preserving as much information as possible
    • E.g. find best planar approximation to 3D data
    • E.g. find best planar approximation to 10^4 D data
    • In particular, choose an orthogonal projection that minimizes the squared error in reconstructing original data
  • Like auto-encoding neural networks, learn representation of input data that can best reconstruct it
  • PCA:
    • Learned encoding is linear function of inputs
    • No local minimum problems when training!
    • Given d-dimensional data X, learns d-dimensional representation, where
    • the dimensions are orthogonal
    • top k dimensions are the k-dimensional linear re-representation that minimizes reconstruction error (sum of squared errors)

PCA: Find Projections to Minimize Reconstruction Error

Assume data is set of d-dimensional vectors, where nth vector is x^n = <x_1^n, x_2^n, ..., x_d^n>

We can represent these in terms of any d orthogonal vextors u_1 ... u_d, x^n = SUM_{i=1}^d z_i^n u_i

Note we get zero error if M = d, so all error is due to missing components.

PCA Algorithm 1

  1. X ← Create N x d data matrix, with

   one row vector x^n per data point

  1. X ← subtract mean x^- from each row

  vector x^n in X

  1. Σ ← covariance matrix of X
  2. Find eigenvectors and eigenvalues of Σ
  3. PC’s ← the M eigenvectors with largest eigenvalues

PCA Example

What if large dimensions like images? 10^4 dimensions → 10^8 covariance matrix!

SVD can solve this.


Independent Components Analysis (ICA)

Find a linear transformation x = V · s, for which coefficients s = (s1, s2, …, sD)^T are sraristically independent: p(s1, s2, ..., sD) = p1(s1) p2(s2) ... pD(sD).

Algorithmically, we need to identify matrix V and coefficients s, s.t. under the condition x = V^T · s the mutual information between s1,s2,…,sD is minimized.

  • PCA finds directions of maximum variation
    • Practically, draw the ellipse of data then find the direction and magnitude arrow.
  • ICA would find directions most "aligned" with data

Canonical Correlation Analysis (CCA)

Idea: Learning Shared Representation across datasets.

E.g. when several people are thinking "bottle", there should be some representations in brain similar. So we can use CCA to analysis the shared representations.

Matrix Factorization

Compare the results from this to PCA, the words with similar meanings will be grouped.

  • PCAcomponents
    • well, long, if, year, watch
    • plan, engine, e, rock, very
    • get, no, features, music, via
    • features, by, links, free, down
    • works, sound, video, building, section
  • NNSEcomponents
    • inhibitor, inhibitors, antagonists, receptors, inhibition
    • bristol, thames, southampton, brighton, poole
    • delhi, india, bombay, chennai, madras
    • pundits, forecasters, proponents, commentators, observers – nosy, averse, leery, unsympathetic, snotty

TBD: The Unified Loss Function of Logistic Regression, AdaBoost, and SVM

SO COOL!!! [to be added soon]

Write a Comment