Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search
Nicola Dainese*, Matteo Merler*, Minttu Alakuijala, Pekka Marttinen
Aalto University
Nicola Dainese*, Matteo Merler*, Minttu Alakuijala, Pekka Marttinen
Aalto University
* Denotes equal contribution.
Abstract
In this work we consider Code World Models, world models generated by a Large Language Model (LLM) in the form of Python code for model-based Reinforcement Learning (RL). Calling code instead of LLMs for planning has the advantages of being precise, reliable, interpretable, and extremely efficient. However, writing appropriate Code World Models requires the ability to understand complex instructions, to generate exact code with non-trivial logic and to self-debug a long program with feedback from unit tests and environment trajectories. To address these challenges, we propose Generate, Improve and Fix with Monte Carlo Tree Search (GIF-MCTS), a new code generation strategy for LLMs. To test our approach, we introduce the Code World Models Benchmark (CWMB), a suite of program synthesis and planning tasks comprised of 18 diverse RL environments paired with corresponding textual descriptions and curated trajectories. GIF-MCTS surpasses all baselines on the CWMB and two other benchmarks, and we show that the Code World Models synthesized with it can be successfully used for planning, resulting in model-based RL agents with greatly improved sample efficiency and inference speed.
Code World Models (CWM) Framework
Given the description of an environment and a task, we use an LLM guided by the GIF-MCTS method to iteratively generate and refine a candidate CWM. The candidate’s correctness is evaluated by checking if it correctly predicts a set of trajectories collected from the true environment. If the model cannot fully predict all transitions, the fraction of correct predictions and other information are given as feedback to the LLM and the cycle repeats. After matching all transitions or having used up a computational budget, the best CWM is returned and used to solve the task via model-based planning.
Generate, Improve and Fix with Monte Carlo Tree Search (GIF-MCTS)
Example of a GIF-MCTS tree for generating a CWM. Starting from the root of the tree, every action taken corresponds to 1) prompting the LLM to either generate, improve or fix a CWM, 2) parsing the LLM completion, and 3) evaluating the CWM’s correctness using the available environment trajectories as unit tests (presented as a percentage inside the nodes). On buggy nodes, we allow only fix actions for up to f sequential attempts and replace the actual value with a temporary one, represented in red. In healthy nodes we allow only generate and improve actions. All action prompts are exemplified on the right. The number of total fix f attempts is a model hyperparameter, set to three in this Figure and for our method.
Experiments
Coding performance of GIF-MCTS on APPS
Code World Model synthesis on apposite benchmark
Examples of Code World Models
Citation
@misc{dainese2024generatingcodeworldmodels,
title={Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search},
author={Nicola Dainese and Matteo Merler and Minttu Alakuijala and Pekka Marttinen},
year={2024},
eprint={2405.15383},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2405.15383},
}
Acknowledgments
This work was supported by the Research Council of Finland (Flagship programme: Finnish Center for Artificial Intelligence FCAI, and grants 352986, 358246) and EU (H2020 grant 101016775 and NextGenerationEU). We acknowledge CSC for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through Finland.