transformer

Notes:

Transformer 

Summary of transformer architecture

In the encoder part (left)

1. the input sequence is converted to word embeddings

2. the input word embeddings are added with position information (positional encoding)

3. in the input sequence, every word's attention to any other word (preceding or succeeding) is calculated

   Note those attentions (self attention) for each word can be calculated in parallel

   

In the decoder part (right)

4. the output sequence up to position k serves as the input into the decoder for prediction at position k+1 (output of decoder)

   Start of Sequence (SOS) -> output word 1

   SOS, word 1 -> word 2

   SOS, word 1, word 2 -> word 3

   SOS, word 1, word 2 ... word k -> word k+1

   ...

   SOS, word 1, word 2 ... word k ... word n-1 -> last word n 

   SOS, word 1, word 2 ... word k ... word n-1, last word n -> End Of Sequence (EOS)

   This is done through masking the output sequence from position k onwards till the end.

5. each of the masked sequences is added with position information

6. every word embedding in a masked sequence is attended to every other word, only preceding words are attended because succeeding words are masked.

   Note an output sequence of length n creates n+1 masked sequences as inputs to decoder

   every masked sequence's attention can be calculated independently.

7. the output of step 6 is further attended to the output of the encoder

   every word (attention) of the step 6 sequence, is attended to the encoder's output from step 3

   this allows every word in the decoder's sequence to attend to the encoder's sequence, so the input sequence's information is included.

8. the output of the decoder is passed through linear layer and softmax to generate prediction.