The paper "Improving Language Understanding by Generative Pre-training" is one of the most influential in the field of NLP. The architecture, GPT, builds on the Transformer decoder and was trained on vast amounts of text data. The model has 117 million parameters and applies unsupervised pre-training followed by supervised fine-tuning, improving tasks like text generation and classification.
GPT uses multi-head causal self-attention to ensure tokens attend only to previous ones. The model applies positional encodings and consists of multiple decoder layers, each with self-attention and feed-forward networks. The model also contained auxillary losses which contains the next word prediction loss in addition to the prediction given the whole statement.
GPT employs ReLU activations and uses residual connections and layer normalization, with the model's decoder performing tasks autoregressively. It was trained using the BooksCorpus dataset and fine-tuned on task-specific labeled datasets, showcasing significant improvements in tasks like textual entailment, question answering, and language modeling.
The paper is one of the most impactful in advancing the use of generative models for NLP tasks. GPT is decoder-based because it uses only the Transformer decoder blocks in its architecture. In the standard Transformer, there are encoder and decoder components, but GPT exclusively uses the decoder portion. The Transformer decoder can attend to previously generated tokens via causal (autoregressive) self-attention, which ensures that the model can only attend to earlier tokens and not future ones. This allows GPT to generate text sequentially, making it an effective model for text generation and language modeling.Â