Large Language Models Are Human-Level Prompt Engineers

ICLR 2023

Yongchao Zhou*, Andrei Ioan Muresanu*, Ziwen Han*, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba

[ ArXiv | GitHub | Colab  | Demo ]

Abstract

By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the “program,” optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Extensive experiments show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 24/24 Instruction Induction tasks and 17/21 curated BIG-Bench tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to improve few-shot learning performance by simply prepending them to standard in-context learning prompts, to find better zero-shot chain-of- thought prompts, as well as to steer models toward truthfulness and/or informativeness.

How does APE work?

Our method, Automatic Prompt Engineer (APE), automatically generates instructions for a task that is specified via output demonstrations: it generates several instruction candidates, either via direct inference or a recursive process based on semantic similarity, executes them using the target model, and selects the most appropriate instruction based on computed evaluation scores.

Due to the infinitely large search space, finding the right instruction can be extremely difficult, APE uses large language models as inference models. We consider two approaches to generate high-quality candidates, namely, forward mode generation and reverse mode generation.

Results

We examines how APE can guide LLMs to desired behaviors. We investigate from three perspectives: zero-shot performance, few-shot in-context learning performance, and truthfulness

Instruction Induction

Zero-shot test accuracy on 24 Instruction Induction tasks. APE achieves human-level performance on 24 out of 24 tasks.

Few-shot in-context test accuracy on 24 Instruction Induction tasks. APE improves the few-shot in-context learning performance on 21 out of 24 tasks.

BIG-Bench

We curate 21 tasks from BIG-Bench with human prompt engineered baselines, including nine tasks from BIG-Bench Hard  (Suzgun et al., 2022). We call this dataset BIG-Bench Instruction Induction (BBII)

APE is able to improve or match human prompts in zero-shot performance on 17/21 tasks.


Zero-shot CoT

APE discovers a better general CoT prompt than the human engineered "Let's think step by step" from (Kojima et al., 2022). 

This prompt to elicit chain of thought reasoning is able to improve the performance on MultiArith (Roy & Roth, 2016) from 78.7 -> 82.0 and performance on GSM8K (Cobbe et al., 2021) from 40.7 -> 43.0

#1 on the leaderboard as of 2022.

(a) Average test performance of APE instructions. Percentage of answers that were either true (% True), informative (% Info), or both (% True + % Info). 

(b) %True-%Info trade-off Test: %True-%Info frontier computed on test data with top 10 instructions selected from each metric.


TruthfulQA

We compared APE prompt with the human prompt from Lin et al. (2022). Figure (a) shows that APE instructions can outperform the human-engineered prompt on all three metrics. Figure (b) investigates the trade-off between truthfulness and informativeness using the top 10 candidates ranked by different metrics. The APE instructions tend to target the two ends of this %true-%info Pareto frontier.

BibTeX

@article{zhou2022large,

      title={Large Language Models Are Human-Level Prompt Engineers}, 

      author={Yongchao Zhou and Andrei Ioan Muresanu and Ziwen Han and Keiran Paster and Silviu Pitis and Harris Chan and Jimmy Ba},

      year={2022},

      eprint={2211.01910},

      archivePrefix={arXiv},

      primaryClass={cs.LG}

}

Acknowledgement

We would like to thank Or Honovich and Michael Zhang for their help and valuable feedback. JB was supported by NSERC Grant [2020-06904], CIFAR AI Chairs program, Google Research Scholar Program and Amazon Research Award. KP was supported by NSERC PGS-D. SP was supported by NSERC CGS-D. HC was supported by NSERC CGS-D and RBC Graduate Fellowship. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute for Artificial Intelligence.