# Hubert Wang

I am
Hubert Wang

Wechat Official Account
Find fun things here!

# Machine Learning 04: DNN, CNN, RNN, and Representation Learning

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

## Why not Logistic Regression?

We like logistic regression, but:

1. how would it perform when trying to learn P(image contains Hillary Clinton | pixel values X1, X2 ... X10,000)?
2. what Xi image features to use? edges? color blotches? generic face? subwindows? lighting invariant properties? position independent? SIFT features?

Deep nets: learn the features automatically!

## Multilayer Networks of Sigmoid Units

### E.g. 1 Speech Recognition

This is a multilayer networks of sigmoid units to do speech recognize. 1-hot vector encoding for the last layer to do classification into certain category. On the right side, we can see the decision surface is very complecated instead of linear surface of logistic regression.

### E.g. 2 ALVINN self-driving car

[]

• 4 hidden units in the hidden layer
• fully connected, output units connecting to all hidden layer units
• One-hot encoding for the output 30 units in training data

### Rectified Linear Unit (ReLU)

[]

Comparing with Sigmoid, the only difference is the activation function *f*. ReLU change the sigmoid function to thresholded output. Note that ReLU is still linear classifier!

### Many types of parameterized units

• Sigmoid units
• ReLU
• Leaky ReLU (fixed non-zero slope for input<0)
• Parametric ReLU (trainable slope)
• Max Pool
• Inner Product
• GRU’s
• LSTM’s
• Matrix multiply
• .... no end in sight

### Training Deep Nets

1. Choose loss function J(θ) to optimize

+ sum of squared errors for y continuous: Σ (y – h(x; θ))^2
+ maximize conditional likelihood: Σ log P(y|x; θ)
+ MAP estimate: Σ log P(y|x; θ) P(θ)
+ 0/1 loss. Sum of classification errors: Σ δ(y = h(x; θ) — Not a good choice because not smooth
+ ...

1. Design network architecture

+ Network of layers (ReLU’s, sigmoid, convolutions, ...)
+ Widths of layers
+ Fully or partly interconnected
+ ...

1. Training algorithm

+ Derive gradient formulas
+ Choose gradient descent method, including stopping condition
+ Experiment with alternative architectures
+ Drop out

## Example: Learn probalistics XOR

• Given boolean Y, X1, X2 learn P(Y|X1,X2), where
• P(Y=0 | X1 = X2) = 0.9
• P(Y=1 | X1 ≠ X2) = 0.9
• Can we learn this with logistic regression?
• No, it's not a linear problem.
• Draw the axis with x1, x2 here to see.
• What can we do?
• Add a hidden layer which gets the inputs from x1 and x2 and output to the y results.

### Gradient Calculation with Chain Rule

Loss function to be minimized: negative log likelihood

J(theta) = E -logP(Y=y|X=x)

use chain rule:

In which:

## Convolution layer, Maxpooling layer

A detailed guide of forward propagation and back propagation.

## Dealing with Overfitting by Validation Set

Our learning algorithm involves a parameter n=number of gradient descent iterations

How do we choose n to optimize future error or loss?

• Separate available data into training and validation set
• Use training to perform gradient descent
• n <— number of iterations that optimizes validation set error

This gives unbiased estimate of optimal n (but still an optimistically biased estimate of true error)

## Batch Normalization

Key idea: add batch normalization layers to network. For each minibatch, BN layer scales each feature to have mean 0, variance 1. Then adds an offset β and scaling γ parameter to train.

Impact of Batch Normalization on MNIST net:

[Quora]

Batch normalization potentially helps in two ways: faster learning and higher overall accuracy. The improved method also allows you to use a higher learning rate, potentially providing another boost in speed. A detailed intro of BN.

## Learning Hidden Layer Representations

[]

Note that the hidden layer is actually like 3 bit to represent numbers: 1-0-0, 0-0-1, 0-1-0, 1-1-1, 0-0-0, 0-1-1, 1-0-1, 1-1-0.

If we have 9 inputs. What will happen?

Note that hidden layer values are not boolean. So it have stronger representative ability .

### Learnt Hidden Unit Weights in Face Recognition

Learning pose from face pictures: Link

## Learning Distributed Representations for Words

Learning Distributed Representations for Words

• also called “word embeddings”
• word2vec is one commonly used embedding
• based on skip gram model

Key idea: given word sequence w1 w2 … wT train network to predict surrounding words. for each word wt predict wt-2, wt-1, wt+1, wt+2

e.g., “the dog jumped over the fence in order to get to..” — “the cat jumped off the widow ledge in order to ...”

### Learning Representations for Words and Relations

NELL (Never Ending Language Learner) is learning to read the web, building large knowledge graph. Yang & Mitchell, 2017 PDF

[]

## What is cross entropy?

Our negative log likelihood loss is also called cross entropy. Why?

## Recurrent Neural Networks

• Many tasks involve sequential data
• predict stock price at time t+1 based on prices at t, t-1, t-2, …
• translate sentences (word sequences) from Spanish to English
• transcribe speech (sound sequences) to text (word sequences)
• Key idea: recurrent network uses (part of) its state at t as input for t+1

Note that in the diagram, the weight W is shared by each iteration. (parameter sharing). When using back propagation to calculate the gradient, there will be a problem of vanishing and / or exploding gradients. (0.9 × 0.9 × 0.9 × … × 0.9 ≈ 0)

Simply clip gradient when they explode.

### Bi-directional Recurrent Neural Networks

Key idea: processing of word at position t can depend on following words too, not just preceding words.

[]

• For the hidden layer units h, it accepts two inputs, one from input x and the other from the previous hidden layer unit (representing the previous time).
• In the bi-directional RNN, g layer is added to accept inputs from "future" (the output of next g unit).
• The output o is decided by both h and g, containing both the "past" and "future" information.

#### Deep Bidirectional Recurrent Network

In [Irsoy & Cardie, 2014], Deep Bidirectional Recurrent Network shows better performance in Opinion Mining tasks than shallow RNN.

## Long Short Term Memory (LSTM) Unit

In previous RNN, there is a problem that weight vanishing after long steps (e.g. 100 steps later). In this case, the network cannot make use of information happened long ago. LSTM is to solve the problem by adopting several gates for the hidden layer units.