The paper "Attention is All You Need" is one of the most influential papers in the field of natural language processing and deep learning. The architecture, Transformer, was introduced to replace recurrent neural networks and LSTMs, demonstrating superior performance in sequence-to-sequence tasks like machine translation. It significantly improved the BLEU score on the WMT 2014 English-to-German dataset, reaching 28.4 BLEU, compared to 23.7 BLEU by previous state-of-the-art models.
The Transformer architecture consists of an encoder-decoder framework, both built using self-attention mechanisms. Each encoder and decoder has six layers, and the model has a total of around 213 million parameters. The novelty of the Transformer lies in its self-attention mechanism, which allows each word in the input to attend to every other word, capturing long-range dependencies more effectively than RNN-based models. The multi-head attention mechanism allows the model to attend to multiple relationships in parallel, improving the richness of learned representations.
Positional encodings were introduced to retain information about the order of words in sequences since the self-attention mechanism does not inherently account for word order. The architecture also included feed-forward layers, residual connections, and layer normalization to enhance training stability.
The model was trained on 8 NVIDIA P100 GPUs and took around 3.5 days to converge. The training was highly efficient due to the model’s ability to process entire sequences in parallel, which contrasts with the sequential processing in RNNs and LSTMs. The parallelization, made possible by self-attention, is one of the key reasons for the speedup, and it allowed the Transformer to be scaled to large datasets efficiently.
The authors used several techniques to improve generalization and reduce overfitting. These included label smoothing and dropout applied at various stages of the model, such as on the attention weights and feed-forward layers. Dropout, in particular, played a crucial role in preventing the model from overfitting the training data, while also slowing down the convergence, as seen in many deep learning models.
The paper is a major milestone in the field of machine learning and sequence modeling, showing that recurrence or convolution is not necessary for capturing long-range dependencies in sequence-to-sequence tasks. The Transformer has become the foundation for many later models like BERT, GPT, and T5. The original paper can be found at this link.