Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

AbstractLarge language models (LLMs) have made impressive progress in natural language processing. These models rely on proper human instructions (or prompts) to generate suitable responses. However, the potential of LLMs are not fully harnessed by commonly-used prompting methods: many human-in-the-loop algorithms employ ad-hoc procedures for prompt selection; while auto prompt generation approaches are essentially searching all possible prompts randomly and inefficiently. We propose Evoke, an automatic prompt refinement framework. In Evoke, there are two instances of a same LLM: one as a reviewer (LLM-Reviewer), it scores the current prompt; the other as an author (LLM-Author), it edits the prompt by considering the edit history and the reviewer's feedback. Such an author-reviewer feedback loop ensures that the prompt is refined in each iteration. We further aggregate a data selection approach to Evoke, where only the hard samples are exposed to the LLM. The hard samples are more important because the LLM can develop deeper understanding of the tasks out of them, while the model may already know how to solve the easier cases. Experimental results show that Evoke significantly outperforms existing methods. For instance, in the challenging task of logical fallacy detection, Evoke scores above 80, while all other baseline methods struggle to reach 20.

Comparison between hand crafted and Evoke prompts

How does Evoke work?

The workflow comprises three steps: First, the LLM-Author edits prompts from previous iterations, taking into account the past edits and the feedback from the LLM-Reviewer. Second, the LLM-Reviewer scores the revised prompts from the LLM-Author, and the top-n candidates with the highest scores are selected for subsequent procedures.

The LLM-Reviewer employs a memory module that stores history edits, prompts and task accuracy of history prompts. Finally, the task accuracy for each instruction is computed.

Results

We are exploring how Evoke can enhance the performance of LLMs across various tasks. These tasks include:

1) Instruction Induction

2) Big Bench

3) Adversarial SST2 and QQP

4) Named Entity Recognition

For example, on the challenging logical fallacy detection task from BBII, performance of Evoke is more than 80, while performance of both APE and Human are below 20.

This is because Evoke is adept at conceptualizing the core definition of a task, decomposing a complex task into smaller subtasks, and curating relevant demonstrations accompanied by detailed explanations.

To demonstrate the power of Evoke, we show the generated prompt for logical fallacy detection below.

We observe that Evoke significantly outperforms all the baselines in all the tasks.

The performance gain is more significant for adversarially constructed datasets.

Tasks above are all sentence-level classification tasks, e.g., deciding whether a sentence is of positive or negative sentiment. Here, we show Evoke can handle more fine-grained tasks, such as token-level named entity recognition.