Notes:
Summary of transformer architecture
In the encoder part (left)
1. the input sequence is converted to word embeddings
2. the input word embeddings are added with position information (positional encoding)
3. in the input sequence, every word's attention to any other word (preceding or succeeding) is calculated
Note those attentions (self attention) for each word can be calculated in parallel
In the decoder part (right)
4. the output sequence up to position k serves as the input into the decoder for prediction at position k+1 (output of decoder)
Start of Sequence (SOS) -> output word 1
SOS, word 1 -> word 2
SOS, word 1, word 2 -> word 3
SOS, word 1, word 2 ... word k -> word k+1
...
SOS, word 1, word 2 ... word k ... word n-1 -> last word n
SOS, word 1, word 2 ... word k ... word n-1, last word n -> End Of Sequence (EOS)
This is done through masking the output sequence from position k onwards till the end.
5. each of the masked sequences is added with position information
6. every word embedding in a masked sequence is attended to every other word, only preceding words are attended because succeeding words are masked.
Note an output sequence of length n creates n+1 masked sequences as inputs to decoder
every masked sequence's attention can be calculated independently.
7. the output of step 6 is further attended to the output of the encoder
every word (attention) of the step 6 sequence, is attended to the encoder's output from step 3
this allows every word in the decoder's sequence to attend to the encoder's sequence, so the input sequence's information is included.
8. the output of the decoder is passed through linear layer and softmax to generate prediction.
About Key, Query and Value in self-attention
Basic Idea of Key, Query, and Value.
Query: Represents the element that is currently seeking information or context.
Key: Represents the elements that provide information or context.
Value: Contains the actual information or data that is retrieved based on the attention scores between the query and key.
How It Works?
Linear Projections:
In practice, the input data (often the embeddings of words or tokens) is transformed into three different vectors through learned linear projections. These vectors are the queries, keys, and values.
Calculating Attention Scores:
Dot Product: The attention score is computed by taking the dot product of the query vector with each key vector. This determines how much focus should be given to each key based on the current query.
Softmax: The dot products (scores) are then passed through a softmax function to normalize them into a probability distribution. This helps in weighing the importance of each key relative to the query.
Weighted Sum: The attention weights obtained from the softmax step are used to compute a weighted sum of the value vectors. This weighted sum is the output of the attention mechanism.
Why It Works?
By using the query-key mechanism, the attention mechanism dynamically selects relevant pieces of information (values) based on the context provided by the query. This allows the model to adaptively focus on different parts of the input sequence depending on the current context.
The key, value and query are just logical concepts. Each vector type (key, query, value) can capture different aspects of the input sequence. For example, queries might capture contextual focus, keys might encode position-specific features, and values might contain detailed information to be aggregated. This specialization helps in better learning and representation of complex patterns in the data.
They represent different spaces of the data and learning, and allow the model training to learn about the project matrix for each space individually.
Consider an example in machine translation where you want to translate a sentence from English to French.
Query: The current word in the French sentence being generated.
Key: The words in the English sentence.
Value: The same words in the English sentence, but they might be transformed (e.g., embeddings) to match the context.
The attention mechanism will help determine which words in the English sentence are relevant to the current word being generated in French. For instance, if you’re translating the word "cat," the attention might focus more on the word "cat" in the English sentence if it’s contextually relevant.