Evolutionary System Prompt Learning for Reinforcement Learning in LLMs

Lunjun Zhang, Ryan Chen, Bradly C. Stadie

University of Toronto, Northwestern University, Bridgewater AIA Labs

What if we combine RL and Evolutionary Algorithm (EA) into a new paradigm of LLM self-improvement?

Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a method for jointly improving model contexts and model weights:

In each RL iteration, E-SPL samples trajectories under multiple system prompts in parallel, then jointly applies RL updates to LLM weights and evolutionary updates to system prompts.
System prompts evolve via mutation and crossover, two genetic operators driven by LLM self-reflection; selection is based on relative performance ratings updated across RL iterations.

RL + Evolution

Evolutionary System Prompt Learning (E-SPL) jointly optimizes model contexts and model weights to enhance LLM self-improvement. Evolution updates system prompts; RL updates weights. Each system prompt is assigned a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration. The learned system prompts can encode declarative knowledge via articulated principles and strategies, while RL gradients can further hone the model’s procedural knowledge for reliable execution.

Two Genetic Operators: Mutation and Crossover

Mutation operator in E-SPL. The highest-performing prompt in each iteration undergoes LLM self-reflection on group-wise agent trajectories and their outcomes. An LLM-generated diff edits the parent into a child system prompt, removing ineffective rules and converting observed mistakes into improved declarative instructions, yielding a new prompt that enters the evolutionary population.

Crossover operator in E-SPL. System prompts are compared based on their problem-wise performance within the current RL batch. Guided by these differential strengths and weaknesses, an LLM self-reflection process selectively recombines the most effective complementary segments from multiple parent prompts, yielding a new child prompt that enters the evolutionary population.

Self-reflection is key to genetic operators: Both mutation and crossover are LLM-based self-reflection processes that diagnose failures and synthesize structured edits to produce children for the population; they differ only in how the self-reflection context is constructed. Mutation reflects on agent trajectories from a single best-performing system prompt to repair its weaknesses, while crossover reflects on comparisons across multiple system prompts in their problem-wise performance to recombine complementary segments.

Evolutionary trees of E-SPL

During RL, E-SPL creates an evolutionary tree of system prompts, by re-suing the same data already generated by RL. Each genetic operator (mutation or crossover) only requires a sampling server for LLM self-reflection with different context construction strategies, which can be concurrent with RL gradient updates.

An Example of Discovered System Prompt for Solving Math

Discovered strategies in learned System Prompts for solving math problems. Those explicit behavior specifications include: useful heuristics and tips for various categories of problems, self-verification strategies such as checking for consistency and plausibility, a list of common failure modes to avoid, etc. Note that RL is done under diverse system prompts, and does not overfit to any particular one.

An Example of Discovered System Prompt for RL with Search Engines

Here, the discovered system prompt specifies a workflow that includes: generative self-verification, detailed instructions on tool use, critical formatting rules, heuristics for determining the reliability of various information sources, actionable guidance for avoiding common failure modes, etc. Such workflows emerge through E-SPL and are absent from the initial system prompt.

The RL process, in turn, does not need to discover new workflows from scratch and encode them in weights, but rather refines and perfects the execution of the workflows specified in the discovered system prompts.

E-SPL Boosts Sample Efficiency of RL

Evolution + RL (our E-SPL method) performs best compared to self-reflection (prompting only), evolution (prompting only), and RL (weight update only). The results show that coupling RL with evolution on system prompts unlocks a synergistic form of self-improvement that neither approach achieves in isolation.

Examples below illustrate E-SPL’s effectiveness: high-level strategies are specified by the system prompt, while RL fine-tunes execution. Self-rewrite of the system prompt induces more systematic behavior changes than weight updates alone.

Examples of How E-SPL Shapes Model Behavior

An example of how the RL model learns to utilize learned system prompts (numbered G1 ... G30) to solve problems. In this example, the RL model persistently explores many potential directions, repeating phrases like "Alternatively", "Another idea", "But let’s think", before simplifying the problem to a form of modular arithmetic and recognizing that it should apply CRT in that situation [G22] according to the system prompt.

This example illustrates how the RL model uses a self-verification strategy learned in the system prompt to solve a problem on BeyondAIME test set: after initially concluding that "So the answer is 2", the model explicitly cites this strategy [G26] from the learned system prompt, and follows the exact instruction to perform sanity checks by starting from small instances of n=2 and n=3, before moving on to the general case and saying "Wait, careful" when it identified its own mistake. The model then self-corrects and concludes with "So the answer is 432. This is the intended solution", which is correct.

Mistakes in Learned System Prompts

The discovered system prompts sometimes include heuristics that are not always true but still helpful to problem-solving. Unlike RL updates to model weights, the learned system prompts are interpretable and thus can be monitored and corrected (either by humans, or by further verification / formalization / scalable oversight) . Here we showcase two subtle mistakes in two discovered principles, and what their corrected versions should be.

Future Directions

The fundamental reason why E-SPL can work is that the base LLMs already have strong in-context learning and instruction following abilities. The RL process in itself does not fully take advantage of this – RL performs trial and error conditioned on an exogenous piece of context, without having the model think about how it can directly re-program itself by re-writing its own system instructions. E-SPL is a step towards augmenting RL with self-rewrite, by jointly evolving a population of system prompts for self-conditioning.

Agentic Retrieval for Long System Prompts: To build agents that can absorb an arbitrary amount of knowledge, the learned system prompt will eventually need a more refined sub-structure with many long components. A natural solution is to store certain components of the system prompt in the local file-system, and have the model access them in an agentic manner using a terminal and grep commands.

Self-Write as an RL Problem: So far, the LLM used for self-reflection and self-rewrite does not self-improve, so eventually it might become a bottleneck. It is possible to extend RL to the entire self-reflection and selfwrite process, but this would require additional research.

Self-Referential Recursive Self-Improvement: Our current context update and weight update algorithms are fundamentally limited; ultimately, the model should be allowed to make paradigm-level changes to its own learning algorithm. This can be achieved by allowing the model to directly modify its own training script and weights, thereby realizing fully self-referential self-improvement (Schmidhuber, 2003). A system capable of such root-level self-modification, unlocking sustained, open-ended, unbounded improvements in its own capabilities, is also called Seed AI (Yudkowsky, 2007).

Citation

@article{e-spl,

title={Evolutionary System Prompt Learning for Reinforcement Learning in LLMs},

author={Zhang, Lunjun and Chen, Ryan and Stadie, Bradly C},

journal={arXiv preprint arXiv:2602.14697},

year={2026}

}

Page updated

Google Sites

Report abuse