GPT-2 was introduced in the paper "Language Models are Unsupervised Multitask Learners." It extends the architecture of GPT, scaling it up significantly with more parameters (up to 1.5 billion) and using more data for training. GPT-2 leverages a Transformer decoder-only architecture, pre-trained on vast amounts of internet text and fine-tuned for specific tasks.
The key idea is that GPT-2 learns to perform many tasks without task-specific training, making it an unsupervised multitask learner capable of generating coherent text and completing tasks like translation and summarization without explicit task-specific supervision.
WebText is the dataset used to train GPT-2. It is a large corpus created by scraping content from web pages that were shared on Reddit, excluding those with fewer than 3 upvotes. This filtering ensures higher-quality text, which is representative of a wide range of topics.
Key aspects discussed in the paper include:
GPT-2's ability to perform various NLP tasks in a zero-shot setting (without task-specific fine-tuning).
Scaling the model size improves its performance across tasks like text generation, summarization, and translation.
The paper is available here: Language Models are Unsupervised Multitask Learners (openai.com)Â