Dreaming in Code for Curriculum Learning in Open-Ended Worlds

Konstantinos Mitsides , Maxence Faldor , Antoine Cully

Imperial College London

Preprint. Under Review.

Standard RL (PPO-GTrXL)

Dreaming in Code(DiCode)

Abstract

Open-ended learning frames intelligence as emerging from continual interaction with an ever-expanding space of environments. While recent advances have utilized foundation models to programmatically generate diverse environments, these approaches often focus on discovering isolated behaviors rather than orchestrating sustained progression. In complex open-ended worlds, the large combinatorial space of possible challenges makes it difficult for agents to discover sequences of experiences that remain consistently learnable. To address this, we propose Dreaming in Code (DiCode), a framework in which foundation models synthesize executable environment code to scaffold learning toward increasing competence. In DiCode, “dreaming” takes the form of materializing code-level variations of the world. We instantiate DiCode in Craftax, a challenging open-ended benchmark characterized by rich mechanics and long-horizon progression. Empirically, DiCode enables agents to acquire long-horizon skills, achieving a 16% improvement in mean return over the strongest baseline and non-zero success on late-game combat tasks where prior methods fail. Our results suggest that code-level environment design provides a practical mechanism for curriculum control, enabling the construction of intermediate environments (or levels) that bridge competence gaps in open-ended worlds.

Dreaming in Code Loop

Curating the Lineage

DiCode maintains a structured archive of all generated levels. To ensure the agent develops a broad repertoire of skills, the selection mechanism explicitly promotes lineage diversity, prioritizing unexplored branches over already-mastered paths.

Using a Learnability Score, the selector identifies "parent" levels that are currently in the agent's Zone of Proximal Development. This ensures the curriculum always branches off from the frontier of the agent's current capabilities.

Generation Cycle

To scaffold progress toward the target environment, DiCode employs a process we term "dreaming": utilizing a foundation model to conceptualize and imagine the next optimal training scenario, tailored to the agent's current skill frontier, and materializing it into executable level.

Unlike standard UED methods that only randomly rearrange map tiles, DiCode programmatically specifies the whole environment logic. It algorithmically specifies the world topology, redefines interaction rules and progression logic, and defines specific objectives for each level.

Training

Training occurs on a stratified batch. Every update mixes experience from three sources:

20% Target Environment: To ensure grounded progress on the real task.
Newly Generated Levels: To introduce the next logical challenges.
Archived Levels: Replayed via PLR to prevent forgetting.

Before training, every generated level undergoes a Compilation Check to filter out invalid code.

Key Results: Unlocking the Impossible

Instrumental Competence

(The Setup)

Baselines struggle to prepare for danger. DiCode teaches the agent to "gear up" first, achieving 45% success on Iron Armour (vs. 14% for baselines). This defensive preparation is a crucial prerequisite for survival.

Deep Exploration

(The Journey)

Because agents are better equipped, they survive the transition to harder floors significantly more often. DiCode agents reach the Gnomish Mines (Floor 2) in 30% of episodes, compared to just 9% for the strongest baseline.

Solving the Intractable

(The Result)

Most critically, "dreaming" specific combat scenarios unlocks tasks that remain effectively impossible for standard RL and UED methods. While baseline performance collapses to 0% on the Gnome Warrior and Gnome Archer, DiCode achieves 11% and 9% success respectively, demonstrating the acquisition of complex, long-horizon combat skills.

Visualizing the Curriculum

We provide complete, interactive HTML archives for 5 independent training runs. These files allow you to inspect the full evolutionary history of the curriculum, including the exact level description, foundation model reasoning, Python code, and agent performance for every single generated level.

Instruction: Select a Run below to launch the interactive archive.

RUN 1

RUN 2

RUN 3

RUN 4

RUN 5

Global Evolution

Early levels (e.g., Level 112) act as "training wheels," providing pre-built workstations and free resources to bypass tedious prerequisites. As the agent improves, the model removes this help, introducing higher mob pressure (Level 287) and eventually forcing the agent to descend into the Gnomish mines (Level 532) – a deep exploration bottleneck rarely reached by standard agents.

Local Mutation

The inset (112 to 143) illustrates a specific pedagogical behavior: Scaffolding Removal. Upon detecting high competence, the model edits the Python code to strip away the free resources (Red text) while expanding the objective (Green text). This forces the agent to learn how to gather and craft on its own.

The Goldilocks Zone

This adaptive pressure keeps the agent in the "Zone of Proximal Development." Throughout the entire run, the average success rate on the training batch remains stable at ~0.5, ensuring the challenge is always perfectly matched to the agent's current capabilities – neither too boring nor too frustrating.

Note: To see the complete descriptions for the levels shown above, click the button and search by their respective numbers.

Click to Interact

Page updated

Google Sites

Report abuse