Image generation has been successfully cast as an autoregressive sequence generation or trans- formation problem. Recent work has shown that self-attention is an effective way of modeling tex- tual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood.
read more
The input is run through the six layers of encoders, and the final output is then sent to the Multi-Head Attention layer of all the decoders. The Masked Multi-Head Attention layer takes in the output of the previous decoder blocks as input. This way, the decoders take into consideration the word from the previous time step and the context of the word from the encoding process.
All the decoders work together to create an output vector which is transformed into a logits vector using a linear transformation. The logits vector has a size equal to the number of words in the vocabulary. This vector is then passed through a softmax function, which tells us how likely a word will be the next word in the generated sentence. The softmax function basically tells us what the next word will be.
read more