Machine Learning 04: DNN, CNN, RNN, and Representation Learning

Why not Logistic Regression?
Multilayer Networks of Sigmoid Units
Example: Learn probalistics XOR
A Convolutional Neural Net for Handwritten Digit recognition: LeNet
Convolution layer, Maxpooling layer
Dealing with Overfitting by Validation Set
Batch Normalization
Learning Hidden Layer Representations
- Learnt Hidden Unit Weights in Face Recognition
Learning Distributed Representations for Words
What is cross entropy?
Recurrent Neural Networks
- Gradient Clipping: Managing exploding gradients
- Bi-directional Recurrent Neural Networks
  - Deep Bidirectional Recurrent Network
Long Short Term Memory (LSTM) Unit
Gated Recurrent Units (GRUs)
Programming Frameworks for Deep Nets

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

Why not Logistic Regression?

We like logistic regression, but:

how would it perform when trying to learn P(image contains Hillary Clinton | pixel values X1, X2 ... X10,000)?
what Xi image features to use? edges? color blotches? generic face? subwindows? lighting invariant properties? position independent? SIFT features?

Deep nets: learn the features automatically!

Multilayer Networks of Sigmoid Units

E.g. 1 Speech Recognition

This is a multilayer networks of sigmoid units to do speech recognize. 1-hot vector encoding for the last layer to do classification into certain category. On the right side, we can see the decision surface is very complecated instead of linear surface of logistic regression.

E.g. 2 ALVINN self-driving car

[]

4 hidden units in the hidden layer
fully connected, output units connecting to all hidden layer units
One-hot encoding for the output 30 units in training data

Sigmoid Unit

Rectified Linear Unit (ReLU)

[]

Comparing with Sigmoid, the only difference is the activation function *f*. ReLU change the sigmoid function to thresholded output. Note that ReLU is still linear classifier!

Many types of parameterized units

Sigmoid units
ReLU
Leaky ReLU (fixed non-zero slope for input<0)
Parametric ReLU (trainable slope)
Max Pool
Inner Product
GRU’s
LSTM’s
Matrix multiply
.... no end in sight

Training Deep Nets

Choose loss function J(θ) to optimize

+ sum of squared errors for y continuous: Σ (y – h(x; θ))^2
+ maximize conditional likelihood: Σ log P(y|x; θ)
+ MAP estimate: Σ log P(y|x; θ) P(θ)
+ 0/1 loss. Sum of classification errors: Σ δ(y = h(x; θ) — Not a good choice because not smooth
+ ...

Design network architecture

+ Network of layers (ReLU’s, sigmoid, convolutions, ...)
+ Widths of layers
+ Fully or partly interconnected
+ ...

Training algorithm

+ Derive gradient formulas
+ Choose gradient descent method, including stopping condition
+ Experiment with alternative architectures
+ Drop out

Example: Learn probalistics XOR

Given boolean Y, X1, X2 learn P(Y|X1,X2), where
- P(Y=0 | X1 = X2) = 0.9
- P(Y=1 | X1 ≠ X2) = 0.9
Can we learn this with logistic regression?
- No, it's not a linear problem.
- Draw the axis with x1, x2 here to see.
What can we do?
- Add a hidden layer which gets the inputs from x1 and x2 and output to the y results.

Feedforward Process

Gradient Calculation with Chain Rule

Loss function to be minimized: negative log likelihood

J(theta) = E -logP(Y=y|X=x)

use chain rule:

In which:

Back Propagation Process

A Convolutional Neural Net for Handwritten Digit recognition: LeNet

Convolution layer, Maxpooling layer

A detailed guide of forward propagation and back propagation.

Dealing with Overfitting by Validation Set

Our learning algorithm involves a parameter n=number of gradient descent iterations

How do we choose n to optimize future error or loss?

Separate available data into training and validation set
Use training to perform gradient descent
n <— number of iterations that optimizes validation set error

This gives unbiased estimate of optimal n (but still an optimistically biased estimate of true error)

Batch Normalization

Key idea: add batch normalization layers to network. For each minibatch, BN layer scales each feature to have mean 0, variance 1. Then adds an offset β and scaling γ parameter to train.

Impact of Batch Normalization on MNIST net:

[Quora]

Batch normalization potentially helps in two ways: faster learning and higher overall accuracy. The improved method also allows you to use a higher learning rate, potentially providing another boost in speed. A detailed intro of BN.

Learning Hidden Layer Representations

[]

Note that the hidden layer is actually like 3 bit to represent numbers: 1-0-0, 0-0-1, 0-1-0, 1-1-1, 0-0-0, 0-1-1, 1-0-1, 1-1-0.

If we have 9 inputs. What will happen?

Note that hidden layer values are not boolean. So it have stronger representative ability .

Learnt Hidden Unit Weights in Face Recognition

Learning pose from face pictures: Link

Learning Distributed Representations for Words

also called “word embeddings”
word2vec is one commonly used embedding
based on skip gram model

Key idea: given word sequence w1 w2 … wT train network to predict surrounding words. for each word wt predict wt-2, wt-1, wt+1, wt+2

e.g., “the dog jumped over the fence in order to get to..” — “the cat jumped off the widow ledge in order to ...”

Word2Vec Word Embeddings

Skip-gram Word Embeddings

Learning Representations for Words and Relations

NELL (Never Ending Language Learner) is learning to read the web, building large knowledge graph. Yang & Mitchell, 2017 PDF

[]

What is cross entropy?

Our negative log likelihood loss is also called cross entropy. Why?

Recurrent Neural Networks

Many tasks involve sequential data
- predict stock price at time t+1 based on prices at t, t-1, t-2, …
- translate sentences (word sequences) from Spanish to English
- transcribe speech (sound sequences) to text (word sequences)
Key idea: recurrent network uses (part of) its state at t as input for t+1

Note that in the diagram, the weight W is shared by each iteration. (parameter sharing). When using back propagation to calculate the gradient, there will be a problem of vanishing and / or exploding gradients. (0.9 × 0.9 × 0.9 × … × 0.9 ≈ 0)

Gradient Clipping: Managing exploding gradients

Simply clip gradient when they explode.

Bi-directional Recurrent Neural Networks

Key idea: processing of word at position t can depend on following words too, not just preceding words.

[]

For the hidden layer units h, it accepts two inputs, one from input x and the other from the previous hidden layer unit (representing the previous time).
In the bi-directional RNN, g layer is added to accept inputs from "future" (the output of next g unit).
The output o is decided by both h and g, containing both the "past" and "future" information.

Deep Bidirectional Recurrent Network

In [Irsoy & Cardie, 2014], Deep Bidirectional Recurrent Network shows better performance in Opinion Mining tasks than shallow RNN.

Long Short Term Memory (LSTM) Unit

In previous RNN, there is a problem that weight vanishing after long steps (e.g. 100 steps later). In this case, the network cannot make use of information happened long ago. LSTM is to solve the problem by adopting several gates for the hidden layer units.