ICML 2025 Challenge on Automated Math Reasoning and Extensions

https://icml.cc/virtual/2025/workshop/39979

Ballroom C, West Building, Vancouver Convention Center

July 18th, 2025 @ Vancouver, Canada (Hybrid).

Participants are encouraged to submit their papers or technical reports on their solutions for more opportunities for in-person communications.

CHALLENGE - TRACK 1: APE-Bench I
File-level Automated Proof Engineering (APE) of Formal Math Libraries

Background:

Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Automated Proof Engineering (APE) aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. APE-Bench I is the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. This task lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.

Task: Each task is defined as a transformation from an input pair to a target code diff:

Input:
- full_instruction: A natural language command describing the intended change
- content_before: A full Lean source file before the edit
Target Output:
- gold_diff: A unified diff representing the ground truth edit made by a human developer

The system must generate a unified diff that, when applied to content_before, produces a syntactically valid and semantically correct Lean file according to the instruction.

The gold_diff is provided only for model development and debugging. It must not be used in any way for test-time prediction. Any such use constitutes data leakage.

Data Structure:

The dataset file contains:

full_instruction Natural language description of the intended modification

content_before Full Lean source file before the edit

gold_diff Ground truth unified diff

Tasks are drawn from real Git commits in Mathlib4 between 2023 and 2025. They cover diverse file-level proof engineering actions, from small fixes to substantial additions and refactorings.

Participants may use:

Hosted LLM APIs (e.g., GPT-4, Claude, Gemini)
Self-trained models or fine-tuned language models
Tool-augmented pipelines (e.g., compiler feedback, diff repair)
Multi-step agentic systems with memory, search, or retrieval

Scoring Criteria:

Each test-time patch is evaluated using a two-stage process:

Syntactic Verification
- The patch is applied to the Lean file
- The resulting file must compile successfully under the correct Mathlib4 version
Semantic Evaluation
- A large language model determines whether the patch correctly fulfills the instruction
- The evaluation uses an instruction-aware judge with structured prompting and multi-sample voting

A task is marked successful only if both checks pass.

Semantic evaluation on the private leaderboard will use an enhanced judging mechanism, building on the method described in the APE-Bench I paper. It incorporates stricter consistency checks and additional robustness enhancements to ensure fairness and accuracy.

Leaderboards:

To provide feedback during development and ensure rigorous final evaluation, this competition uses both a public and a private leaderboard:

Public Leaderboard The public leaderboard is computed on a fixed subset of 160 tasks from the test set. It is updated based on your most recent valid submission. This leaderboard is designed to give participants timely feedback during development and help guide improvements. Up to one update is allowed per team per day.
Private Leaderboard The private leaderboard determines the final ranking and competition results. It is based on a separate hidden portion of the test set and uses a more rigorous evaluation protocol, including strengthened semantic judging. The private leaderboard is not visible during the competition. It will be revealed only once, after the final evaluation is complete.

Participants are expected to develop robust systems that generalize well to unseen instructions and task variations. The private evaluation is designed to minimize the risk of overfitting to spurious patterns in the test inputs and more faithfully reflects real-world performance.

Both leaderboards will be hosted and maintained on Codabench. [Leaderboard link will be updated here]

Submission Instructions:

Participants must submit a Python-based solution that can generate a valid diff for each test task by programmatically invoking large language models.

Your submission should include a file named submission.py, which defines a class Submission that follows the specified interface.
Your code must call a model via an API to generate diffs from the provided inputs. This can be:
- A commercial LLM API such as OpenAI GPT-4, Claude, or Gemini
- A custom-trained model deployed by you and made accessible via an HTTP API
You are responsible for providing the base URL, authentication keys (e.g., API tokens), and any other necessary configurations to make the API callable from the evaluation environment.

The evaluation system will execute your submission.py code, which should include the logic to call your target model via API and return the predicted diffs.

A starter template will be provided with the required class structure and example usage of common APIs.

Submission Sites: https://www.codabench.org/competitions/8357/

Reference:

[1] Xin, H., Li, L., Jin, X., Fleuriot, J., & Li, W. (2025). APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries. arXiv preprint arXiv:2504.19110.

CHALLENGE - TRACK 2: SeePhys
Physics Reasoning with Diagrams and Expressions

Background:

With mathematical reasoning as a cornerstone, the realm of natural science, especially physics, remains understudied with LLM-related reasoning. On the one hand, physics reasoning inherits a portion of the thinking methods and research approaches from mathematical reasoning. On the other hand, it bridges the reasoning with real-world modeling, especially joining diagrams and expressions. SeePhys provides a wide-spectrum benchmark with physics questions ranging from middle school to PhD qualifying exams, covering 7 fundamental domains spanning the physics discipline, featuring 2,000 rigorously curated questions paired with 2,200+ diverse diagrams (e.g., Feynman diagrams, circuit schematics).

Task Description:

Given a textual description of a physics problem along with its corresponding diagram (e.g., schematic, circuit diagram, free-body diagram), generate a textual answer.

Data:

Each data point includes the following fields:

index
question: Text of the question
image_path: List of images
img_category: Type of diagram, such as circuit diagrams and coordinate systems
vision_relevance: It indicates whether the image contains the necessary information required to solve the problem
language: English/Chinese
level: Knowledge level
sig_figs: Number of significant figures in the answer

Evaluation Rules:

SeePhys task evaluates the model based on accuracy of the final answer. First, an LLM is used to extract the answer from response. We then compare extracted answer with ground-truth to calculate the matching rate.

Leaderboards:

In this competition, we create both public and private leaderboards for a better participation experience:

Public leaderboard: The team ranking in the public leaderboard is measured by a mixed testmini dataset containing 200 data points. The purpose of this public leaderboard is to help participants debug and fine-tune their LLMs results.
Private leaderboard: The competition awards are only based on the team ranking in the private leaderboard. The highest score of the participants on the private leaderboard will be the final result of the competition.

Submission Instructions:

We require participants to use local multimodal models or APIs to generate predictions on the test data. You can use gpt-4o or other large multimodal models.
For the public leaderboard, you should submit predictions in 'prediction.json' (auto-evaluated on Codabench).
For the private leaderboard, you should submit predictions in 'final_prediction.json' (results released on July 7, Anywhere on Earth).
During the competition, you can make up to 5 submissions a day, with a maximum of 100 submissions to the public & private leaderboards.
Please download the Starting Kit for more instructions.

Submission Sites: https://www.codabench.org/competitions/7925/

[1] Xiang, K., Li, H., Zhang, T. J., Huang, Y., Liu, Z., Qu, P., ... & Liang, X. (2025). SeePhys: Does Seeing Help Thinking?--Benchmarking Vision-Based Physics Reasoning. arXiv preprint arXiv:2505.19099.

COMPETITION SCHEDULE

Submission Open (Both Track 1 & Track 2): May 21st, 2025
Submission Deadline: July 1st, 2025, AoE
Challenge Result Announcement: July 7th, 2025, AoE
Paper Decision Notification: July 7th, 2025, AoE

AWARDS

Track 1

- 1st place on private leaderboard: 1,000 USD; invited talk at the workshop; an official certificate
- 2nd place on private leaderboard: 550 USD
- 3rd place on private leaderboard: 450 USD

Track 2

- 1st place on private leaderboard: 1,000 USD; invited talk at the workshop; an official certificate
- 2nd place on private leaderboard: 550 USD
- 3rd place on private leaderboard: 450 USD

NOTE: Virtual talks will be held via Zoom.

TERMS & CONDITIONS

Participants can choose to participate in one or more of the tracks.
Submitted code must use a machine learning model to generate predictions. Submission of hard-coded prediction results is prohibited.
Users may work in teams of up to 6 people.
Note: The top-3 teams of each track should make their training and testing code publicly available for verification after the testing submission deadline.