MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Deepak Nathani, Lovish Madaan , Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka , Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes , Ricardo Silveira Cabral , Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu

Paper Code

The MLGym Framework

The ability to evaluate and develop LLM agents on machine learning research tasks is essential to advancing the capabilities of AI research agents. To that end, we’re releasing MLGym, a unified framework that enables researchers to easily implement and experiment with different training algorithms for large language model agents such as reinforcement learning (RL). This powerful tool is specifically designed for AI research assistants, allowing them to tackle complex machine learning research tasks with unprecedented ease. By providing a flexible platform to develop, test, and deploy agents, MLGym allows researchers to evaluate any model without the need for custom development.

We’re also introducing MLGym-bench, a benchmark that consists of 13 AI research tasks that span diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. The benchmark offers a flexible evaluation framework that can accommodate various artifacts such as models, algorithms, or predictions, providing a more adaptable evaluation environment compared to other benchmarks with rigid protocols. It is designed to allow researchers to easily add new tasks, models, and agents to expand its scope. This flexibility enables researchers to tailor their evaluations to specific needs and goals, making it an ideal tool for advancing machine learning research.

MLGym Agents can improve on a given AI Research task by generating new ideas and hypotheses, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process multiple times.

With MLGym and MLGym-bench, researchers can now focus on developing novel agentic orchestrations, implementing new training algorithms, or generating synthetic data at scale to advance the capabilities of AI Research Agents. We believe this is a crucial step towards building AI systems that can accelerate scientific discovery. By sharing our framework and benchmark, we hope to provide a foundation for future research, enabling the community to better understand the capabilities and limitations of AI Research Agents.

Capability Levels for AI Research Agents

Accelerating scientific discovery has been a long-standing ambition in artificial intelligence (AI) research, with early initiatives like the Oak Ridge Applied Artificial Intelligence Project in 1979 exploring . More recent explorations enabled by advances in foundation models provide a proof-of-concept of a fully automated pipeline for end-to-end paper generation.

In the future, we envision AI Research Agents capable of independently conducting literature search, generating scientific hypotheses, designing experiments, implementing new methods, analyzing results, disseminating findings by writing scientific papers, and applying this research in products, thus assisting with all parts of the research process. Such agents should be capable of both working fully autonomously, or be guided by human supervision, taking into account feedback from users.

To evaluate the capabilities of frontier AI Agents to accelerate AI Research, we propose a hierarchical framework consisting of six levels, each representing a distinct degree of autonomy and scientific contribution. MLGym-Bench focuses on Level 1 capabilities, but can also be extended to other levels.

MLGym-Bench

MLGym-Bench contains a set of diverse open-ended AI research tasks from a wide-range of domains such as CV, NLP, RL, game theory, and logical reasoning.

Novelty of MLGym

MLGym is the first agentic framework for AI Research Agents based on a Gym interface and thus separate the agent from the environment. This allows easy integration and training with RL algorithms, opening a new research avenue on different training methods for AI Research Agents. MLGym-Bench is also the only benchmark to include algorithmic tasks such as RL, game theory, and logic reasoning.

In contrast with many existing benchmarks for AI Research Agents, MLGym includes diverse open-ended research tasks spanning a wide range of domains, comes with an agentic harness, and allows for flexible evaluation artifacts such as model checkpoints, RL algorithms, or code representing game theoretical strategies.

Evaluating Performance

We propose to use performance profile curves and AUP scores to compare the performance of multiple agents on MLGym tasks. This allows us to compare relative performance gains across both agents and tasks. We measure the agents' best attempt and best submission out of 4 trials.

We find that OpenAI O1-preview is the best-performing model on aggregate across our set of tasks for both best attempt and best submission, with Gemini 1.5 Pro and Claude-3.5-Sonnet being close behind.

Computational Cost

While OpenAI O1-Preview is the best-performing model, it is also the most computationally expensive by a wide margin. In contrast, Gemini-1.5-Pro and Claude-3.5- Sonnet are much more cost-effective while still reaching high performance not too far from OpenAI O1’s

Gemini-1.5-Pro strikes the best balance between performance and cost on MLGym-Bench, being the cheapest model to run (9x cheaper than O1) while achieving 99% of O1’s AUP.

While relatively cheap to run, GPT-4o and Llama-3.1-405b perform worse than the other models.

Agent Behavior Analysis

Our analysis highlights a structured approach for solving AI Research tasks, where agents begin with getting familiar with the environment and the task, conduct multiple iterations of inspecting the code, editing it, training models, evaluating them, and often conclude with a submission.

Check out the paper for more analysis of the agents’ behaviors on MLGym.

Demos: Language Modeling

Watch the MLGym Agent improve on a language modeling task through idea generation, implementation, experimentation, and iteration.

Demos: Reinforcement Learning

Watch the MLGym Agent improve on a reinforcement learning task.

Demos: Game Theory

Watch the MLGym Agent improve on a game theory task.

Demos: Image Classification

Watch the MLGym Agent improve on an image classification task.

References

If you find our work helpful, please consider citing it using the following:

@misc{nathani2025mlgymnewframeworkbenchmark,

title={MLGym: A New Framework and Benchmark for Advancing AI Research Agents},

author={Deepak Nathani and Lovish Madaan and Nicholas Roberts and Nikolay Bashlykov and Ajay Menon and Vincent Moens and Amar Budhiraja and Despoina Magka and Vladislav Vorotilov and Gaurav Chaurasia and Dieuwke Hupkes and Ricardo Silveira Cabral and Tatiana Shavrina and Jakob Foerster and Yoram Bachrach and William Yang Wang and Roberta Raileanu},

year={2025},

eprint={2502.14499},

archivePrefix={arXiv},

primaryClass={cs.CL},

url={https://arxiv.org/abs/2502.14499},

}

For questions and comments, please contact Deepak Nathani or Roberta Raileanu at:

dnathani[at]ucsb.edu, raileanu[at]meta.com

We also have an MLGym Discord channel!

Page updated

Google Sites

Report abuse