GPT introduction

GPT3

GPT3 is a Decoder only model (see below the architecture compared to original transformer).

It uses only the decoder part of the Transformer architecture.

It is a 'next word prediction' training.

With a question as an initial input sequence, it outputs the answer word by word through the decoder.

Every predicted word is appended to the initial input sequence as the new input for the next word.

This is an unsupervised training. 

It use large text datasets, such as reddit q&a for training the decoder model.

It always predicts only the next word, but it continues until the whole answer sequence is returned (where End of Sequence EOS is predicted).

The GPT3 is considered an initial Language Model (LM)

From GPT3 to ChatGPT

There are three major steps:

1. Supervised fine tuning (SFT)

  Use queries and responses writen by users / labelers to fine-tune the pre-trained GPT3 model.

  

2. Reward model (RM)

For reponses generated by the fined tuned SFT model, using different decoing strategies (e.g. topk, nucleus, greedy, etc.) generates different responses.

A human then reviews the responses and rank them.

The queries, responses and ranks are used to trained another model, i.e. the reward model

The output of the Reward Model is a scalar reward representing human preference.

The input is the query + reponse pair, instead of only the query

The Reward Model can be based on a clone of the SFT model or another fine-tuned GPT model. 

The reason is the Reward Model needs to be as capable as the intial GPT model in order to understand the query + reponse pair.

Once trained, the RM is able to evaluate the quality (reward) of query + response pairs.

3. Fine tuning with Reinforment Learning (RL)

This is about fine tuning part of the parameters of the SFT model, using policy gradient RL algorithm, i.e. the Proximal Policy Optimization (PPO).

tuning all the parameters is prohibitively expensive, so for a large model, usually part of the parameters are tuned.

In this step, unseeen input sequences are fed into a clone of the SFT model in step 1.

The input sequences and the generated responses are passed onto the Reward model in step 2 for checking the quality of the reponses.

The output reward from the Reward Model is then used to finetune the SFT model. 

This allows the SFT model to incorporate more human like characteristic or some behaviors we hope it to learn about.


The SFT model is sometimes refered to as a policy. here is the Reinforment learning using PPO

The whole training with reward and fine tuning is called Reinforcement Learning with Human Feedback RLHF.