FLAN - Simplified

Fine-tuned LAnguage Net (FLAN) is a instruction-tuned NLP model and performs well in zero-shot learning. FLAN is a natural language processing model introduced by Google AI on October 6, 2021.

FLAN is a Zero-Shot Learner model. A 137B parameter Base-LM model is instruction tuned to build the FLAN model which solves general NLP tasks instead of a specific task

Need for better Large Zero-shot Models

Large Language Models (LMs) like GPT-3 are good at few-shot learning, but not good at zero-shot learning. Large LMs acquire universal knowledge as they scale.

Fine-tuning is one of the methods to unlock this huge knowledge of LMs and apply it to real-world tasks. Fine-tuning is a process in which we take a pre-trained model (eg: BERT, T5) and train it further on specific labelled data for specific applications.But Fine-tuning needs more data (examples) and stored model weights for each task and it is not practical for large LMs.

DATASET

Cluster

Creating an instruction tuning dataset from scratch would be resource-intensive, hence the existing datasets from the research community were transformed into an instructional format. 62 public Tensorflow text datasets (both language understanding and language generation tasks), were combined into a single mixture. Each dataset is categorized into 1 of 12 task clusters, for which datasets in a given cluster are of the same task type. Datasets are grouped into clusters below:

Template

Templates are used to transform existing datasets into an instructional format. For each dataset, 10 unique templates that use natural language instructions to describe the task for that dataset is manually composed. While most of the ten templates describe the original task, to increase diversity, for each dataset ,up to 3 templates that “turned the task around are included(e.g., for sentiment classification , templates asking to generate a movie review are included).

A pretrained language model (Base LM ) is instruction-tuned on the mixture of all datasets, with examples in each dataset formatted via a randomly selected instruction template for that dataset.

Training

Base LM Model - Architecture & Pretraining

A dense left-to-right, decoder-only transformer language model of 137B parameters called Base-LM model has been selected and instruction tuned to form FLAN model. Base-LM is pretrained on a collection of web documents (computer code), dialog data & Wikipedia, tokenized into 2.81T BPE tokens with a 32k vocabulary using the SentencePiece library.

Around 10% of the pretraining data was non-English. This dataset is not as clean as the GPT-3 training set and also has a mixture of dialog and code, hence zero and few-shot performance of this pretrained LM on NLP tasks is expected to be slightly lower. Base LM was also previously used for program synthesis (Austin et al., 2021).

Instruction Tuning

FLAN is the instruction-tuned version of Base LM. Instruction Tuning (IT) is the process of fine tuning the language models on a collection of datasets described via instructions. IT boosts zero-shot performance on unseen tasks.

IT pipeline mixes all datasets and random samples from each dataset. To balance the different sizes of datasets, the number of training examples per dataset is limited to 30k.

The examples proportional mixing scheme (Raffel et al., 2020) with a maximum mixing rate of 3k is adopted.

All models are fine-tuned for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer, with a learning rate of 3e-5. The input and target sequence lengths are 1024 and 256.

Packing is used to combine multiple training examples into a single sequence. Inputs are separated from targets using a special EOS token.

For all evaluations, results are reported on the final checkpoint trained for 30k steps.

Evaluation

To compare performance of FLAN with other models established benchmark datasets were used. To evaluate FLAN on c task clusters, c models are instruction tuned, where each model holds out a different task cluster for evaluation.

For example, training on one question-answering(QA) dataset might help the model do better on another QA dataset and the performance results will still be skewed, as training & eval datasets are similar. Hence, all datasets are grouped into clusters by type of task and hold out not just the training data for the dataset, but the entire task cluster to which the dataset belongs.

Inference

The output space for a given task is either:

classification ( one of several classes ) or
generation ( free text )

As FLAN is an instruction-tuned version of a decoder-only language model, it naturally responds in free text, and so no further modifications are needed for generation tasks.

For classification tasks, rank classification approach (with two outputs “yes” and “no” => e.g., a large number of alternative ways of saying “yes” may lower the probability mass assigned to “yes”) is modified, options suffix is included, in which the token OPTIONS is appended to the end of a classification task (along with a list of the output classes for that task ).

Results

IT is very effective on tasks naturally verbalized as instructions (e.g., NLI, QA, translation, struct-to-text) and is less effective on tasks directly formulated as language modeling, where instructions would be largely redundant (e.g., commonsense reasoning and coreference resolution tasks , formatted as finishing an incomplete sentence or paragraph)

FLAN surpasses zero-shot 175B GPT-3 on 20 of 25 datasets. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC,OpenbookQA, and StoryCloze. Ablation studies reveal that number of datasets and model scale are key components to the success of instruction tuning.

Super Scaling

Model scale is very important for the model to benefit from instruction tuning. At smaller scales, the FLAN technique actually degrades performance, and only at larger scales the model is able to generalize from instructions in the training data to unseen tasks. Because the models that are too small do not have enough parameters to perform a large number of tasks.

The effect of instruction tuning (IT) is evaluated on models of size 422M, 2B, 8B, 68B, and 137B parameters. For small-scale models, learning the ∼40 tasks used during IT fills the entire model capacity, causing these models to perform worse on new tasks. For the larger scale models, IT fills up some model capacity but also teaches these models the ability to follow instructions, allowing them to generalize to new tasks with the remaining capacity.

Though, FLAN model is not the first model to train on a set of instructions, but it is the first model in which instruction-tuning technique is applied at scale and generalization ability of the model is improved.

Hope you are doing well ... Pleasure meeting you online ...

I am Sri Lakshmi , AI Practitioner, Developer & Technical Content Producer

Liked this article ??? If you want me to : Write articles that give simple explanations of complex topics / Design outstanding presentations / Develop cool AI apps / Launch and popularize your products to the target audience / Manage social media and digital presence / Partner or Collaborate with me, feel free to discuss with me your ideas & requirements by clicking the button below

Let's work together

Date : 13 October, 2021

Author : Sri Lakshmi

References : [1] [2] [3]

Update your knowledge with my other interesting Articles

GSLM SoundStream Codex GAN BlenderBot 2 Triton

Page updated

Google Sites

Report abuse