Convolutional Neural Networks
July 2019
Taught by Andrew Ng
Module 1: Recurrent Neural Networks
Why sequence models
Used in NLP, speech recognition, music generation, sentiment classification, language translation, video activity recognition, name entity recognition
Input or output is a sequence
Notation
Want to find names
x: Harry Potter and Hermione Granger invented a new spell
x^<1> x^<2> ... x^<t> ... x^<9>
T_x=9
y: 1 1 0 1 1 0 0 0 0
y^<1> y^<2> .... y^<t> ... y^<9>
T_y=9
9 sets of features
X^(i)<t> T_x^(i)
y^(i)<t> T_y^(i)
How to represent invidual words in the sentence? Come up with a vocabulary/dictionary (e.g. every word in complete dictionary). e.g. 10,000 words.
You can use one-hot representation. Each word > vector of 0's but with a 1 at the index of the word in the dictionary.
If you come across a word which is not in your dictionary you create a fake word called <unk> (for example) and assigned to that.
Recurrent Neural Network Model
Why not a standard network?
- Inputs and outputs can be different length in different examples
- Doesn't share features learned across different positions of text (e.g. Learn Harry at position 1 is a persons name then apply that to Harry at position 10)
Take first word and feed it into a neural network layer
a^<0> -> x^<1> -> y^^<1> and pass activation function a^<1> -> x^<2> -> y^^<2>
Initialize a^<0> randomly or all 0's.
Some people use a loop to denote the recurrent network.
- Uses same parameters W_ax at each time step for the x input.
- Uses same parameters W_aa at each time step for the activation output.
- Uses same parameters W_ya at each time step for the predicted output.
Don't use future words e.g. y^<3> only 'knows' about y^<1> and y^<2> and not y^<4>. This is a problem because the sentence 'He said, "Teddy Roosevelt was a great President"' compared with 'He said, "Teddy bears are on sale!"'. Need Bidirectional Neural Networks (BRNN).
Forward propagation
a^<0> = [0,...0]
a^<1> = g(W_aa * a^<0> + W_ax * x^<1> + b_a) <- mostly tanh sometimes RelU
yhat^<1> = g(W_ya * a^<1> + b_y) -< sigmoid (if binary classification)
a^<t> = g(W_aa * a^<t-1> + W_ax * x^<t> + b_a) [1]
y^<t> = g(W_ya * a^<t> + b_y)
[1] -> (for simply notification) a^<t> = g(W_a[a^<t-1>, x^<t>] + b_a);
- where W_a is [W_aa | W_ax ] e.g. if W_aa is (100, 100) and W_ax is (100, 10000) then W_a is (100, 10100)
- and [a^<t-1>, x^<t>] = [ a^<t-1> / x^<t> ] e.g. if a^<t-1> is (100) and x^<t> is 10000) then [a^<t-1>, x^<t>] is (10100)
- [W_aa | W_ax] * [a^<t-1> / x^<t> ] = W_aa * a^<t-1> + W_ax * x^<t>
Backpropagation through time
L^<1> ... L^<ty>
y^<1> ... y^>ty>. Need Wyby which is used at every layer.
a^<0> a^<1> ... a^<tx>. Need parameters Waba which is used at every layer.
x^<1> ... x^<tx>
Element-wise loss L^<t>(y^^<y>, y^<t>) = -y^<t> log y^^<t> - (1 - W ) . Logistic regression loss or cross-entropy loss [binary classification loss].
L(y^, y) = sum t=1:Ty L^<t>(y^^<y>, y^<t>)
Backpropagation through time means it goes through the sequential layers.
Different types of RNNs
Tx = Ty (input same size as output). Many-to-many.
Machine translation. Read sentence the output. Can be different lengths. Many-to-many.
Sentiment classification. x = test. y = 0/1 1...5). Output y at last time step. Many-to-one.
One-to-one.
Music generation. x = integer e.g. genre. Output music (notes). Feed output to next time step.
Language model and sequence generation
What is language modelling?
Speech recognition. e,g, 'apple and pear salad' or 'apple and pair salad'. Output the probability of these.
P(sentence). P(y^<1>, ..., y^<ty>). Probability of sequence of words.
Training set: large corpus of English text.
e.g. 'Cats average 15 hours of sleep a day.';
Tokenize the sentence. e,g, ohe, indices and token for <EOS> end of sentence. Could also add period to vocabulary.
Some words may not be in corpus -> <UNK> unknown word.
Probability of the first word (softmax). Probability of second word with correct first word. Probability of third word with correct first two words.
Softmax loss function: L(yhat^<t>, y^<t>) = - sum (y_i^<t> log yhat_i^<t>
Overall loss: L=sum(L^<t>(y^<t>, y<t>).
Predict the chance of the next word.
Sampling novel sequences
Sample from the sequence to generate novel sequences.
Sample yhat^<1> according to softmax distribution. Pass yhat^<1> to yhat^<2>.
You could reject any <UNK>.
Character level language model: create a vocabulary for [a,b,c,..., ,]
Don't have to worry about unknown word tokens.
However, much longer sequences and don't capture dependencies.
Vanishing gradients with RNN
The cat, which already ate ..., was full
The cats, which already ate ..., were full
The RNN should capture the long-term dependencies.
So many layers it can be difficult to update weights in an earlier layer based on vanishing gradients.
Exploding gradients can be an easy but you should see NaNs. You can use gradient clipping.
Gated Recurrent Unit (GRU)
Cho et al 2014 https://arxiv.org/abs/1409.1259
Chung et al 2014 https://arxiv.org/abs/1412.3555
The activation of an RNN unit
a^<t> = g(W_a[a^<t-1>, x^<t>] + b_a).
c = memory cell. c tilde is candidate.
To remember if cat was singular or plural.
c^<t> = a^<t>
At every time step going to update c^<t> as
c*<t> = tanh(W_c[c^<t-1>, x^<t>] + b_c)
Update gate (0-1) = Gamma_u = sigmoid (W_u[c^<t-1>, x^<t>] + b_u). Often close to 0 or 1.
e.g. cat c^<t> = 1; cats c^<t> = 2...
c^<t> = gamma_u * c*^<t> + (1 - gamma_u) * c^<t-1>.
Help with vanishing gradients.
Full GRU
c*^<t> = tanh(W_c[c^<t-1>, x^<t>] + b_c)
gamma_U = sigmoid(W_u[c^<t-1>, x^<t>] + b_u)
c^<t> = gamma_u * c^<t> + (1 - gamma_u) * c^<t-1>
Long Short Term Memory (LSTM)
Hochreiter and Schmidhuber (1997)
ctidle^<t> = tanh(W_c[a^<t-1>, x^<t>] + b_c)
gamma_u = sigmoid(W_u[a^<t-1>, x^<t>] + b_u) - update gate
gamma_f = sigmoid(W_f[gamma_f[a^<t-1>, x^<t>] + b_f) - forget gate
gamma_o = sigmoid(W_o[a^<t-1>, x^<t>] + b_o) - output gate
c^<t> = gamma_u * ctidle^<t> + gamma_f * c^<t-1>
a^<t> = gamma_o * c^<t>
Good at memorizing values for a long time.
can add in c^<t-1> into gate is a peephole connection.
Bidirectional RNN (BRNN)
Take information from earlier and later in the network.
Forward component a^<1>, ...
Add a backward layer. a^<4>,...
yhat^<t> = g(W_y[a^forward<t>, a^backward<t>] + b_y)
Deep RNNs
For learning complex function you can stack RNNs.
a^[1]<0>
a^[2]<0>
a^[l]<t>
You can have deep layers after the RNN layers before outputting yhat.
Assignment
https://www.wyzant.com/resources/lessons/math/calculus/derivative_proofs/tanx
Dinosaur name dataset: https://uwwwycpxoqvhatzktdbxog.coursera-apps.org/files/Week%201/Dinosaur%20Island%20--%20Character-level%20language%20model/dinos.txt
Shakespeare poems: https://uwwwycpxoqvhatzktdbxog.coursera-apps.org/files/Week%201/Dinosaur%20Island%20--%20Character-level%20language%20model/shakespeare.txt
LSTM text generation: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py
Ji-Sung Kim, 2016, deepjazz
Jon Gillick, Kevin Tang and Robert Keller, 2009. Learning Jazz Grammars
Robert Keller and David Morrison, 2007, A Grammatical Approach to Automatic Improvisation
François Pachet, 1999, Surprising Harmonies
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Module 2: Word Representation
Word Representation
One-hot representation
I want a glass of orange ___ juice
I want a glass of apple ___ juice
Model doesn't know orange and apple is related as inner-product is 0.
Featurized representation:
Each word is associated with a gender. e.g. -1 for man, 1 for woman
How royal a word is
Age
Food: high scores for apples and oranges
e.g. represent the word man by 300 vector (of features) (embeddings).
You can convert this 300 d vector into 2d space e.g. t-SNE (van der Maaten and Hinton, 2008). e.g. create clusters.
word gets embedded into multi-dimensional data.
Using word embeddings
Detect peoples names (named entity recognition).
Robert Lin is (an apple farmer/a durian cultivator). Word embeddings relate apple farmer to durian cultivator. You can obtain data of ~100 bn words to help with this.
Transfer learning and word embeddings
1. Learn word embeddings from large text corpus (1-100 Bn words) / download pre-trained embedding online.
2. Transfer embedding to new task with smaller training set (say 100k words).
3. Continue to fine-tune the word embeddings with new data (optional)
encoding/embedding are similar
Properties of word embeddings
Mikolov et al (2013) Linguistic regularities in continuous space word representations (Word2Vec)
Man is to Woman as King is to ___ ? (queen)
e_man - e_woman = [-2, 0, 0, 0...]
e_king - e_queen = [-2, 0, 0, 0...]
e_man - e_woman ~= e_king - e_w
Find word in w: argmax w Sim(e_w, e_king - e_man + e_woman)
Cosine similarity: sim(u, v) = u^T v / (||u||_2 ||v||_2)
This is the cosine angle between two vector
Can also use euclidian distance (||u - v||^2)
Man: Woman as Boy: Girl
Ottawa: Canada as Nairobi: Kenya
Embedding matrix
300 x 10,000 matrix (E)
OHE is 10,000 x 1 (O)
This keeps the embedding corresponding to the word
E * O_j = e_j - embedding for word j
Learning Word Embeddings: Word2vec & GloVe
Neural language model Bengio et al 2003
Feed all e_j into a NN layer (W^[1], b^[w]) -> softmax (W^[2], b^[2])
Stack embedding layers.
Predict next work given last 4 words (hyperparamter).
Other contexts:
- 4 words on left and right
- Last 1 word
Word2vec
Skip-grams - have a context and a target. Given a context (c) word randomly choose a target (t). (skip a few words to the left and right)
o_c * E = e_c -> softmax -> y^
Softmax p(t|c) = (e^theta_t.T * e_c) / (sum e^theta_j.T * e_c)
theta_t = parameter associated with target t
L(y^, y) = - sum y_i * log(y^_i). (log-likelihood)
The softmax model is expensive (denominator)
Use a hierachial softmax: first 5k words or second 5k words -> first 2.5k words or second 2.5k words. This scales with the log of vocab size.
When sampling context c (the, of, a, and) these words occur frequently.
Negative Sampling
Create targets of 1's and 0's
Orange juice -> 1 (occurs)
Orange king -> 0 (does not occur)
Use k (number of words) as 2-5 for large datasets and 5-20 for smaller datasets. With ever positive example you will have k negative examples.
Logistic regression
p(y = 1 | c, t) = sigmoid(theta_t.T * e_c)
The algorithm is fairly cheap.
How do you generate the negative examples?
Sampled words proportional to power of work to the 3/4 (in between uniform and observed distribution).
GloVe word vectors
x_ij = # times i appears in conject of j
i ~=t, j ~=t.
a count of words that are close to each other.
minimize (theta_t^T e_j + b_i + b_j^' -log X_ij)^2
t c
skip _ij = 0 as it goes to infinity. This is mostly for stop words (weighted).
theta_i and e_j are symmetric.
e_w^(final) = (e_w + theta+w) / 2
Note the features are not interpretable.
Sentiment Classification
x (text) y star (1-5)
get embedding vector of each word in text -> Average the embedding vectors -> Softmax (1-5) - y^
'lacking in good taste'. can seem positive with the use of good.
Can feed to embedding vectors into a RNN (many-to-one).
Debiasing word embeddings
Issues of
Man: Computer Programmer
Woman: Homemaker
Can reflect gender, ethnicity, age biases.
Learned a word embedding.
1. Identify bias direction
e_he - e_she
e_male - e_female
...
Take average of these and put on scatter plot (bias)
2. Neutralize: for every word that is not defintional (e.g. grandmother), project to get rid of bias (onto axis).
3. Equalize pairs (grandmother - grandfather).
Module 3: Various sequence to sequence architectures
Basic Models
Translate Jane visite l'Afrique en Septembre -> Jane is visiting Africa in September
Encoder network -> Decoder network (language model)
Works for image captioning (convert image to a caption).
Conv net (learn encoding) -> RNN
Picking the most likely sentence
Machine translation as building a conditional language model
Starts off with input sentence instead of all 0's (as in RNNs)
P(y^(1) | x^<(1)) where y in English and X is another language
Same sample words at random
argmax P(y^<1>)
Beam search instead of greedy search. Greedy search is pick the most likely first word then most likely second word.
P (Jane is going | x) > P(Jane is visiting | x)
Beam search
Best/Most likely English sentence.
10,000 word dictionary.
P(y^<1>|x)
B = 3 (beam width). Consider 3 words at a time. (in, Jane, September)
Probability of second word given 3 first words P(y^<2> | X, "in"). Pair of the first two words that are most likely
P(y^<1>, y^<2>|x) = p(y^<1>|x) * p(y^<2> | x, y^<1>)
Do the same for the other two words ("Jane", "September")
Get next 3 most likely: "in September", "Jane is", "Jane visits"
Consider 3rd word: P(y^<3> | x, "in September")
Hopefully gets "Jane visits Africa in September. <EOS>"
B = 1 (greedy search)
Refinements to Beam Search
Length normalization
arg max sum P(y^<t>|x, y^<1>,...,y^<t-1>)
Most words are tiny (small number). Take log instead
arg max sum log P ... (more numerical stable) add more terms becomes more negative.
Normalize by number of words in translations (reduce penalty of longer translations)
1/T_y^alpha sum lop P (can use alpha to scale this)
Possible use for sentences of up to length 30.
How do you choose a larger beam width B? If B is large get better result (consider a lot a words) but is slower
Smaller beam width (worse but faster).
B~=10, 100 (production), 1000, 3000 (research).
Error analysis in beam search
Is it model or beam search causing error?
See if human translation y^* is better than machine translation
P(y^*|x) > P(y^|x) - beam search chose y*. Beam search is at fault
P(y^*|x) < P(y^|x) - RNN is at fault.
Look at mistakes in the dev set. Count the cases where beam search was at fault and count the cases where RNN was at fault.
BLEU score
There may be multiple translations
Humans can provide a couple of references
BLEU (bilingual evaluation understudy)
See if the word in the machine translation appears in the reference (precision). But can be hope if the word is repeated.
Give credit up to maximum number of times the work occurs.
You can do score on bigrams.
P_1 = sum unigram count clip (unigram) / sum unigram count (unigram)
P_n as ngram version.
max score is 1.0
Combined BLEU score: Bp * exp (1/4 sum Pn)
BP (brevity penalty). Easy to score high on short sentences.
Attention Model Intuition
The problem of long sentences. Encoder - Decoder works well for < 40 word sentences.
Attention model maintains skill for longer sentences
Bi-drectional RNN (learn rich set of features) a^<0>. Use another RNN for the translation S^<0> into S^<1> which will output the word.
Use attention weights i.e. look at words close by. Alpha^<1,1>, alpha^<1,2>, ..., alpha^<2,1>, alpha^<2,2>,... -> c -> S^<n>
Attention Model
a^<t> = (a forward ^<t>, a backward ^<t'>)
alpha^<t,t> -> c (context) -> S (state) ^<1> -> y^<c>
c is a weigthed sum of the features (by the alpha). > 0 and sum to 1.
c^<1> = sum alpha^<1, t'> a^<t'>.
alpha^<t, t'> = amount of attention that y^<t> should pay to a^<t'>.
y^<1> -> S^<2>
a^<t, t'> = exp(e^<t,t'>) / sum( exp(e^<t, t,>))
s^<t-1>, a^<t'> -> e^<t, t'>
Tx * Ty (quadratic cost).
Can use to normalize date standards
Visualization of alpha^<t, t'>.
Speech recognition
audio clip (x) -> transcript (y)
preprocess to get spectrum
Phonemes: (units of sounds) de kwik brown
300-3000 hours transcribed
Could use attention model.
CTC (connectionist temporal classification)
Input time steps is large 10 s (1,000 inputs).
Output is ttt_h_eee___ ___qqq__. This is correct for the start of "the quick brown fox". Then it removes repeated characters.
Trigger word detection
e.g. "okay google".
Before trigger word set targets = to 0 and after set targets = to 1. (either single or multiple)
Imbalanced training set.
Assignment
You can download background noises from free online sources.