Convolutional Neural Networks

July 2019

Coursea: Sequence Models

Module 1: Recurrent Neural Networks

Why sequence models

Used in NLP, speech recognition, music generation, sentiment classification, language translation, video activity recognition, name entity recognition

Input or output is a sequence

Notation

Want to find names

x: Harry Potter and Hermione Granger invented a new spell

x^<1> x^<2> ... x^<t> ... x^<9>

T_x=9

y: 1 1 0 1 1 0 0 0 0

y^<1> y^<2> .... y^<t> ... y^<9>

T_y=9

9 sets of features

X^(i)<t> T_x^(i)

y^(i)<t> T_y^(i)

How to represent invidual words in the sentence? Come up with a vocabulary/dictionary (e.g. every word in complete dictionary). e.g. 10,000 words.

You can use one-hot representation. Each word > vector of 0's but with a 1 at the index of the word in the dictionary.

If you come across a word which is not in your dictionary you create a fake word called <unk> (for example) and assigned to that.

Recurrent Neural Network Model

Why not a standard network?

Inputs and outputs can be different length in different examples
Doesn't share features learned across different positions of text (e.g. Learn Harry at position 1 is a persons name then apply that to Harry at position 10)

Take first word and feed it into a neural network layer

a^<0> -> x^<1> -> y^^<1> and pass activation function a^<1> -> x^<2> -> y^^<2>

Initialize a^<0> randomly or all 0's.

Some people use a loop to denote the recurrent network.

Uses same parameters W_ax at each time step for the x input.
Uses same parameters W_aa at each time step for the activation output.
Uses same parameters W_ya at each time step for the predicted output.

Don't use future words e.g. y^<3> only 'knows' about y^<1> and y^<2> and not y^<4>. This is a problem because the sentence 'He said, "Teddy Roosevelt was a great President"' compared with 'He said, "Teddy bears are on sale!"'. Need Bidirectional Neural Networks (BRNN).

Forward propagation

a^<0> = [0,...0]

a^<1> = g(W_aa * a^<0> + W_ax * x^<1> + b_a) <- mostly tanh sometimes RelU

yhat^<1> = g(W_ya * a^<1> + b_y) -< sigmoid (if binary classification)

a^<t> = g(W_aa * a^<t-1> + W_ax * x^<t> + b_a) [1]

y^<t> = g(W_ya * a^<t> + b_y)

[1] -> (for simply notification) a^<t> = g(W_a[a^<t-1>, x^<t>] + b_a);

where W_a is [W_aa | W_ax ] e.g. if W_aa is (100, 100) and W_ax is (100, 10000) then W_a is (100, 10100)
and [a^<t-1>, x^<t>] = [ a^<t-1> / x^<t> ] e.g. if a^<t-1> is (100) and x^<t> is 10000) then [a^<t-1>, x^<t>] is (10100)
[W_aa | W_ax] * [a^<t-1> / x^<t> ] = W_aa * a^<t-1> + W_ax * x^<t>

Backpropagation through time

L^<1> ... L^<ty>

y^<1> ... y^>ty>. Need Wyby which is used at every layer.

a^<0> a^<1> ... a^<tx>. Need parameters Waba which is used at every layer.

x^<1> ... x^<tx>

Element-wise loss L^<t>(y^^<y>, y^<t>) = -y^<t> log y^^<t> - (1 - W ) . Logistic regression loss or cross-entropy loss [binary classification loss].

L(y^, y) = sum t=1:Ty L^<t>(y^^<y>, y^<t>)

Backpropagation through time means it goes through the sequential layers.

Different types of RNNs

Tx = Ty (input same size as output). Many-to-many.

Machine translation. Read sentence the output. Can be different lengths. Many-to-many.

Sentiment classification. x = test. y = 0/1 1...5). Output y at last time step. Many-to-one.

One-to-one.

Music generation. x = integer e.g. genre. Output music (notes). Feed output to next time step.

Language model and sequence generation

What is language modelling?

Speech recognition. e,g, 'apple and pear salad' or 'apple and pair salad'. Output the probability of these.

P(sentence). P(y^<1>, ..., y^<ty>). Probability of sequence of words.

Training set: large corpus of English text.

e.g. 'Cats average 15 hours of sleep a day.';

Tokenize the sentence. e,g, ohe, indices and token for <EOS> end of sentence. Could also add period to vocabulary.

Some words may not be in corpus -> <UNK> unknown word.

Probability of the first word (softmax). Probability of second word with correct first word. Probability of third word with correct first two words.

Softmax loss function: L(yhat^<t>, y^<t>) = - sum (y_i^<t> log yhat_i^<t>

Overall loss: L=sum(L^<t>(y^<t>, y<t>).

Predict the chance of the next word.

Sampling novel sequences

Sample from the sequence to generate novel sequences.

Sample yhat^<1> according to softmax distribution. Pass yhat^<1> to yhat^<2>.

You could reject any <UNK>.

Character level language model: create a vocabulary for [a,b,c,..., ,]

Don't have to worry about unknown word tokens.

However, much longer sequences and don't capture dependencies.

Vanishing gradients with RNN

The cat, which already ate ..., was full

The cats, which already ate ..., were full

The RNN should capture the long-term dependencies.

So many layers it can be difficult to update weights in an earlier layer based on vanishing gradients.

Exploding gradients can be an easy but you should see NaNs. You can use gradient clipping.

Gated Recurrent Unit (GRU)

Cho et al 2014 https://arxiv.org/abs/1409.1259

Chung et al 2014 https://arxiv.org/abs/1412.3555

The activation of an RNN unit

a^<t> = g(W_a[a^<t-1>, x^<t>] + b_a).

c = memory cell. c tilde is candidate.

To remember if cat was singular or plural.

c^<t> = a^<t>

At every time step going to update c^<t> as

c*<t> = tanh(W_c[c^<t-1>, x^<t>] + b_c)

Update gate (0-1) = Gamma_u = sigmoid (W_u[c^<t-1>, x^<t>] + b_u). Often close to 0 or 1.

e.g. cat c^<t> = 1; cats c^<t> = 2...

c^<t> = gamma_u * c*^<t> + (1 - gamma_u) * c^<t-1>.

Help with vanishing gradients.

Full GRU

c*^<t> = tanh(W_c[c^<t-1>, x^<t>] + b_c)

gamma_U = sigmoid(W_u[c^<t-1>, x^<t>] + b_u)

c^<t> = gamma_u * c^<t> + (1 - gamma_u) * c^<t-1>

Long Short Term Memory (LSTM)

Hochreiter and Schmidhuber (1997)

ctidle^<t> = tanh(W_c[a^<t-1>, x^<t>] + b_c)

gamma_u = sigmoid(W_u[a^<t-1>, x^<t>] + b_u) - update gate

gamma_f = sigmoid(W_f[gamma_f[a^<t-1>, x^<t>] + b_f) - forget gate

gamma_o = sigmoid(W_o[a^<t-1>, x^<t>] + b_o) - output gate

c^<t> = gamma_u * ctidle^<t> + gamma_f * c^<t-1>

a^<t> = gamma_o * c^<t>

Good at memorizing values for a long time.

can add in c^<t-1> into gate is a peephole connection.

Bidirectional RNN (BRNN)

Take information from earlier and later in the network.

Forward component a^<1>, ...

Add a backward layer. a^<4>,...

yhat^<t> = g(W_y[a^forward<t>, a^backward<t>] + b_y)

Deep RNNs

For learning complex function you can stack RNNs.

a^[1]<0>

a^[2]<0>

a^[l]<t>

You can have deep layers after the RNN layers before outputting yhat.

Assignment

https://www.wyzant.com/resources/lessons/math/calculus/derivative_proofs/tanx

Dinosaur name dataset: https://uwwwycpxoqvhatzktdbxog.coursera-apps.org/files/Week%201/Dinosaur%20Island%20--%20Character-level%20language%20model/dinos.txt

Shakespeare poems: https://uwwwycpxoqvhatzktdbxog.coursera-apps.org/files/Week%201/Dinosaur%20Island%20--%20Character-level%20language%20model/shakespeare.txt

LSTM text generation: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py

Ji-Sung Kim, 2016, deepjazz

Jon Gillick, Kevin Tang and Robert Keller, 2009. Learning Jazz Grammars

Robert Keller and David Morrison, 2007, A Grammatical Approach to Automatic Improvisation

François Pachet, 1999, Surprising Harmonies

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://jalammar.github.io/illustrated-bert/

https://arxiv.org/abs/1511.08308

Module 2: Word Representation

Word Representation

One-hot representation

I want a glass of orange ___ juice

I want a glass of apple ___ juice

Model doesn't know orange and apple is related as inner-product is 0.

Featurized representation:

Each word is associated with a gender. e.g. -1 for man, 1 for woman

How royal a word is

Age

Food: high scores for apples and oranges

e.g. represent the word man by 300 vector (of features) (embeddings).

You can convert this 300 d vector into 2d space e.g. t-SNE (van der Maaten and Hinton, 2008). e.g. create clusters.

word gets embedded into multi-dimensional data.

Using word embeddings

Detect peoples names (named entity recognition).

Robert Lin is (an apple farmer/a durian cultivator). Word embeddings relate apple farmer to durian cultivator. You can obtain data of ~100 bn words to help with this.

Transfer learning and word embeddings

1. Learn word embeddings from large text corpus (1-100 Bn words) / download pre-trained embedding online.

2. Transfer embedding to new task with smaller training set (say 100k words).

3. Continue to fine-tune the word embeddings with new data (optional)

encoding/embedding are similar

Properties of word embeddings

Mikolov et al (2013) Linguistic regularities in continuous space word representations (Word2Vec)

Man is to Woman as King is to ___ ? (queen)

e_man - e_woman = [-2, 0, 0, 0...]

e_king - e_queen = [-2, 0, 0, 0...]

e_man - e_woman ~= e_king - e_w

Find word in w: argmax w Sim(e_w, e_king - e_man + e_woman)

Cosine similarity: sim(u, v) = u^T v / (||u||_2 ||v||_2)

This is the cosine angle between two vector

Can also use euclidian distance (||u - v||^2)

Man: Woman as Boy: Girl

Ottawa: Canada as Nairobi: Kenya

Embedding matrix

300 x 10,000 matrix (E)

OHE is 10,000 x 1 (O)

This keeps the embedding corresponding to the word

E * O_j = e_j - embedding for word j

Embedding layer in Keras

Learning Word Embeddings: Word2vec & GloVe

Neural language model Bengio et al 2003

Feed all e_j into a NN layer (W^[1], b^[w]) -> softmax (W^[2], b^[2])

Stack embedding layers.

Predict next work given last 4 words (hyperparamter).

Other contexts:

4 words on left and right
Last 1 word

Word2vec

Skip-grams - have a context and a target. Given a context (c) word randomly choose a target (t). (skip a few words to the left and right)

o_c * E = e_c -> softmax -> y^

Softmax p(t|c) = (e^theta_t.T * e_c) / (sum e^theta_j.T * e_c)

theta_t = parameter associated with target t

L(y^, y) = - sum y_i * log(y^_i). (log-likelihood)

The softmax model is expensive (denominator)

Use a hierachial softmax: first 5k words or second 5k words -> first 2.5k words or second 2.5k words. This scales with the log of vocab size.

When sampling context c (the, of, a, and) these words occur frequently.

Negative Sampling

Create targets of 1's and 0's

Orange juice -> 1 (occurs)

Orange king -> 0 (does not occur)

Use k (number of words) as 2-5 for large datasets and 5-20 for smaller datasets. With ever positive example you will have k negative examples.

Logistic regression

p(y = 1 | c, t) = sigmoid(theta_t.T * e_c)

The algorithm is fairly cheap.

How do you generate the negative examples?

Sampled words proportional to power of work to the 3/4 (in between uniform and observed distribution).

GloVe word vectors

pennington et al 2014

x_ij = # times i appears in conject of j

i ~=t, j ~=t.

a count of words that are close to each other.

minimize (theta_t^T e_j + b_i + b_j^' -log X_ij)^2

t c

skip _ij = 0 as it goes to infinity. This is mostly for stop words (weighted).

theta_i and e_j are symmetric.

e_w^(final) = (e_w + theta+w) / 2

Note the features are not interpretable.

Sentiment Classification

x (text) y star (1-5)

get embedding vector of each word in text -> Average the embedding vectors -> Softmax (1-5) - y^

'lacking in good taste'. can seem positive with the use of good.

Can feed to embedding vectors into a RNN (many-to-one).

Debiasing word embeddings

bolukbasi et al 2016

Issues of

Man: Computer Programmer

Woman: Homemaker

Can reflect gender, ethnicity, age biases.

Learned a word embedding.

1. Identify bias direction

e_he - e_she

e_male - e_female

...

Take average of these and put on scatter plot (bias)

2. Neutralize: for every word that is not defintional (e.g. grandmother), project to get rid of bias (onto axis).

3. Equalize pairs (grandmother - grandfather).

Module 3: Various sequence to sequence architectures

Basic Models

Sutskever et al 2014

Cho et al 2014

Translate Jane visite l'Afrique en Septembre -> Jane is visiting Africa in September

Encoder network -> Decoder network (language model)

Works for image captioning (convert image to a caption).

Mao et al 2014

Vinyals et al 2014

Karpathy and Fei Fei (2015)

Conv net (learn encoding) -> RNN

Picking the most likely sentence

Machine translation as building a conditional language model

Starts off with input sentence instead of all 0's (as in RNNs)

P(y^(1) | x^<(1)) where y in English and X is another language

Same sample words at random

argmax P(y^<1>)

Beam search instead of greedy search. Greedy search is pick the most likely first word then most likely second word.

P (Jane is going | x) > P(Jane is visiting | x)

Beam search

Best/Most likely English sentence.

10,000 word dictionary.

P(y^<1>|x)

B = 3 (beam width). Consider 3 words at a time. (in, Jane, September)

Probability of second word given 3 first words P(y^<2> | X, "in"). Pair of the first two words that are most likely

P(y^<1>, y^<2>|x) = p(y^<1>|x) * p(y^<2> | x, y^<1>)

Do the same for the other two words ("Jane", "September")

Get next 3 most likely: "in September", "Jane is", "Jane visits"

Consider 3rd word: P(y^<3> | x, "in September")

Hopefully gets "Jane visits Africa in September. <EOS>"

B = 1 (greedy search)

Refinements to Beam Search

Length normalization

arg max sum P(y^<t>|x, y^<1>,...,y^<t-1>)

Most words are tiny (small number). Take log instead

arg max sum log P ... (more numerical stable) add more terms becomes more negative.

Normalize by number of words in translations (reduce penalty of longer translations)

1/T_y^alpha sum lop P (can use alpha to scale this)

Possible use for sentences of up to length 30.

How do you choose a larger beam width B? If B is large get better result (consider a lot a words) but is slower

Smaller beam width (worse but faster).

B~=10, 100 (production), 1000, 3000 (research).

Error analysis in beam search

Is it model or beam search causing error?

See if human translation y^* is better than machine translation

P(y^*|x) > P(y^|x) - beam search chose y*. Beam search is at fault

P(y^*|x) < P(y^|x) - RNN is at fault.

Look at mistakes in the dev set. Count the cases where beam search was at fault and count the cases where RNN was at fault.

BLEU score

There may be multiple translations

Papineni et al 2002

Humans can provide a couple of references

BLEU (bilingual evaluation understudy)

See if the word in the machine translation appears in the reference (precision). But can be hope if the word is repeated.

Give credit up to maximum number of times the work occurs.

You can do score on bigrams.

P_1 = sum unigram count clip (unigram) / sum unigram count (unigram)

P_n as ngram version.

max score is 1.0

Combined BLEU score: Bp * exp (1/4 sum Pn)

BP (brevity penalty). Easy to score high on short sentences.

Attention Model Intuition

The problem of long sentences. Encoder - Decoder works well for < 40 word sentences.

Attention model maintains skill for longer sentences

Bahdanau et al 2014

Bi-drectional RNN (learn rich set of features) a^<0>. Use another RNN for the translation S^<0> into S^<1> which will output the word.

Use attention weights i.e. look at words close by. Alpha^<1,1>, alpha^<1,2>, ..., alpha^<2,1>, alpha^<2,2>,... -> c -> S^<n>

Attention Model

a^<t> = (a forward ^<t>, a backward ^<t'>)

alpha^<t,t> -> c (context) -> S (state) ^<1> -> y^<c>

c is a weigthed sum of the features (by the alpha). > 0 and sum to 1.

c^<1> = sum alpha^<1, t'> a^<t'>.

alpha^<t, t'> = amount of attention that y^<t> should pay to a^<t'>.

y^<1> -> S^<2>

Xu et al (2015)

a^<t, t'> = exp(e^<t,t'>) / sum( exp(e^<t, t,>))

s^<t-1>, a^<t'> -> e^<t, t'>

Tx * Ty (quadratic cost).

Can use to normalize date standards

Visualization of alpha^<t, t'>.

Speech recognition

audio clip (x) -> transcript (y)

preprocess to get spectrum

Phonemes: (units of sounds) de kwik brown

300-3000 hours transcribed

Could use attention model.

CTC (connectionist temporal classification)

Graves et al (2006)

Input time steps is large 10 s (1,000 inputs).

Output is ttt_h_eee___ ___qqq__. This is correct for the start of "the quick brown fox". Then it removes repeated characters.

Trigger word detection

e.g. "okay google".

Before trigger word set targets = to 0 and after set targets = to 1. (either single or multiple)