The paper "BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2018) introduces BERT (Bidirectional Encoder Representations from Transformers), which revolutionized NLP by leveraging bidirectional context during pre-training. BERT uses Masked Language Modeling (MLM) as a key pre-training task: random tokens in the input are masked, and the model learns to predict them based on surrounding tokens, allowing it to understand context from both sides of a word. BERT also uses a Next Sentence Prediction (NSP) task to improve sentence-level understanding.
Bidirectionality: While GPT is an autoregressive model that processes text left-to-right (unidirectional), BERT processes text bidirectionally, capturing both left and right contexts simultaneously.
Masked Language Modeling (MLM) vs. Causal Language Modeling (CLM): In GPT, the model predicts the next word in a sequence (causal modeling), while BERT randomly masks words during training and predicts them based on both previous and future tokens.
Masked Language Model (MLM): During training, 15% of tokens in each input sentence are randomly masked. The model is tasked with predicting the original token based on the context from both directions.
Next Sentence Prediction (NSP): BERT also learns to understand the relationship between sentences by predicting whether a given sentence is the actual next sentence in the original text.
GPT uses a stack of Transformer decoder layers, whereas BERT uses only Transformer encoder layers.
Causal masking in GPT ensures that each word can only attend to the previous words in the sequence, whereas BERT's bidirectional attention attends to all tokens in the sequence.
In essence, BERT is optimized for deep bidirectional understanding of language, while GPT focuses on autoregressive generation of text. The original BERT paper is available here.