ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions 🔗 (best paper!)
Chuanyang Jin ⋅ Binze Li ⋅ Haopeng Xie ⋅ Cathy Mengying Fang ⋅ Tianjian Li ⋅ Shayne Longpre ⋅ Hongxiang Gu ⋅ Maximillian Chen ⋅ Tianmin Shu
Transferability for General Reasoning: An Automated Curriculum for Multi-Domain LLM RL 🔗 (best paper!)
Yongjin Yang ⋅ Jiarui Liu ⋅ Yinghui He ⋅ Lechen Zhang ⋅ Bernhard Schölkopf ⋅ Zhijing Jin
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision 🔗 (best paper!)
Yinghui He ⋅ Simran Kaur ⋅ Adithya Bhaskar ⋅ Yongjin Yang ⋅ Jiarui Liu ⋅ Narutatsu Ri ⋅ Liam Fowl ⋅ Abhishek Panigrahi ⋅ Danqi Chen ⋅ Sanjeev Arora
An Adoption-Aware Crop Recommender from Farmer World Feedback 🔗 (best paper!)
Vairaaj Bindal
Reinforcement Learning from Rich Feedback with Distributional DAgger 🔗
Rishabh Agrawal ⋅ Jacob Fein-Ashley ⋅ Paria Rashidinejad
Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs 🔗
Ang Li ⋅ Yifei Wang ⋅ Zhihang Yuan ⋅ Stefanie Jegelka ⋅ Yisen Wang
ECHO: Terminal Agents Learn World Models for Free 🔗
Vaishnavi Shrivastava ⋅ Ahmed Awadallah ⋅ Dimitris Papailiopoulos
Right in the Right Way: Combining Verifiable Rewards with Human Demonstrations 🔗
Mehul Damani ⋅ Isha Puri ⋅ Idan Shenfeld ⋅ Jacob Andreas
EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL 🔗
Lunjun Zhang ⋅ Jimmy Ba
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents 🔗
Changdae Oh ⋅ Wendi Li ⋅ Seongheon Park ⋅ Min-Hsuan Yeh ⋅ Tanwi Mallick ⋅ Sharon Li
When Is World Feedback Transferable? A Convergence Gate in Contrastive Reinforcement Learning 🔗
Bruce C Xu ⋅ Jay J Park ⋅ Vivek Buch
Single-Step Initialization for Exploratory Parallel Rollouts in Diffusion LLMs 🔗
Dongjae Jeon ⋅ Bumjun Kim ⋅ Mingyu Kim ⋅ Albert No
Reinforcing VLAs in Task-Agnostic World Models 🔗
Yucen Wang ⋅ Rui Yu ⋅ Fengming Zhang ⋅ Junjie Lu ⋅ Xinyao Qin ⋅ Tianxiang Zhang ⋅ Kaixin Wang ⋅ Li Zhao
Coverage Cliffs in Learning from Logged World Feedback 🔗
Pauline Bourigault
Robust Exploration through Generative Replay 🔗
Inhyuck Song ⋅ Taeyoung Yun ⋅ Jaewoo Lee ⋅ Jinkyoo Park
Coherent Off-Policy Improvement of Large Behaviour Models with Learned Rewards 🔗
Christian Scherer ⋅ Joe Watson ⋅ Daniel Palenicek ⋅ Theo Gruner ⋅ Ingmar Posner ⋅ Jan Peters
What Drives Interactive Improvement from Feedback? 🔗
Bartłomiej Cupiał ⋅ Jan Łojek ⋅ Mikołaj Garstecki ⋅ Szymon J Pobłocki ⋅ Alicja Ziarko ⋅ Piotr Milos
Learning to Ask May Be Better Than Learning to Answer 🔗
Asmi Kumar ⋅ Chuang Gan ⋅ Zhang-Wei Hong
MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference 🔗
Raphaël Baur ⋅ Yannick Metz ⋅ Maria Gkoulta ⋅ Mennatallah El-Assady ⋅ Giorgia Ramponi ⋅ Thomas Kleine Buening
Yun Li ⋅ Ehsan Javanmardi ⋅ Yidu Zhang ⋅ Simon Thompson ⋅ Qunli Zhang ⋅ Zifan Zeng ⋅ Shiming Liu ⋅ Peng Wang ⋅ Zixuan Guo ⋅ Manabu Tsukada
Self-Improving World Modelling with Latent Actions 🔗
Yifu QIU ⋅ Zheng Zhao ⋅ Waylon Li ⋅ Yftah Ziser ⋅ Anna Korhonen ⋅ Shay Cohen ⋅ Edoardo Ponti
Sunshine Jiang ⋅ John Marangola ⋅ David Zhang ⋅ Raghuram Kowdeed ⋅ Ruiyang Luo ⋅ Nitish Dashora ⋅ Richard Li ⋅ Pulkit Agrawal ⋅ Zhang-Wei Hong
Nibraas Khan ⋅ Gordon Wichern ⋅ Christopher R Laughman
Training Language Agents to Learn from Experience 🔗
Yuval Shalev ⋅ Zifeng Ding ⋅ Mateja Jamnik
Sikata Sengupta ⋅ Guangyi Liu ⋅ Omer Gottesman ⋅ Joseph Durham ⋅ Michael Kearns ⋅ Aaron Roth ⋅ Michael Caldara
FutureSim: Replaying World Events to Evaluate Adaptive Agents 🔗
Shashwat Goel ⋅ Nikhil Chandak ⋅ Arvindh Arun ⋅ Ameya Pandurang Prabhu ⋅ Steffen Staab ⋅ Moritz Hardt ⋅ Maksym Andriushchenko ⋅ Jonas Geiping
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models 🔗
Samy Jelassi ⋅ Mujin Kwun ⋅ Rosie Zhao ⋅ Yuanzhi Li ⋅ Nicolò Fusi ⋅ Yilun Du ⋅ Sham Kakade ⋅ Carles Domingo i Enrich
Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR 🔗
Soeun Kim ⋅ Albert No
MapShift: Controlled Post-Intervention Evaluation for Embodied World Models 🔗
Aarav Sinha
Current-Inference Transformer Policies for Robust AUV Navigation from World Feedback 🔗
Seihun Kim ⋅ Joongheon Kim
RL Excursions during Pre-training: How early is too early for on-policy learning? 🔗
Rachit Bansal ⋅ Clara Mohri ⋅ Tian Qin ⋅ David Alvarez-Melis ⋅ Sham Kakade
When Internal Feedback Goodharts: SAE Rewards Fail to Improve Robot Success 🔗
Joy Z. Yang ⋅ Socrates Osorio
RAGEN-2: Reasoning Collapse in Agentic RL 🔗
Zihan (Zenus) Wang ⋅ Chi Gui ⋅ Xing Jin ⋅ Qineng Wang ⋅ Licheng Liu ⋅ Kangrui Wang ⋅ Shiqi Chen ⋅ Linjie Li ⋅ Zhengyuan Yang ⋅ Pingyue Zhang ⋅ Yiping Lu ⋅ Jiajun Wu ⋅ Li Fei-Fei ⋅ Lijuan Wang ⋅ Yejin Choi ⋅ Manling Li
Adaptive Action Chunking Strategy from World Feedback in Mixed Traffic 🔗
Hongki Kim ⋅ Sangeun Park ⋅ Minhae Kwon
Workflow-R1: Group Sub-sequence Policy Optimization for Multi-turn Workflow Construction 🔗
Mingze Kong ⋅ Zikun Qu ⋅ Zhongquan Zhou ⋅ Pengyu Liang ⋅ Xiang Li ⋅ Zhiwei Shang ⋅ Zhi Hong ⋅ Kaiyu Huang ⋅ Zhiyong Wang ⋅ Zhongxiang Dai
R Raghav ⋅ Mohammed Farag
Code2World: A GUI World Model via Renderable Code Generation 🔗
Yuhao Zheng ⋅ Li'an Zhong ⋅ Yi Wang ⋅ Rui Dai ⋅ Kaikui Liu ⋅ Xiangxiang Chu ⋅ Linyuan Lü ⋅ Phil Torr ⋅ Kevin Qinghong Lin
Towards Budget-Aware Agents: Can LLM Agents Know What They Will Spend? 🔗
Yuxiang Lin ⋅ Zihan (Zenus) Wang ⋅ Mengyang Liu ⋅ Yuxuan Shan ⋅ Longju Bai ⋅ Junyao Zhang ⋅ Xing Jin ⋅ Boshan Chen ⋅ Jinyan Su ⋅ Xingyao Wang ⋅ Jiaxin Pei ⋅ Manling Li
Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion 🔗
Mykola Vysotskyi ⋅ Zakhar Kohut ⋅ Mariia Shpir ⋅ Taras Rumezhak ⋅ Volodymyr Karpiv
Counterfactual Transport Flows for Offline Conservative Trajectory Refinement 🔗
Lena Krieger ⋅ Xuan Zhao ⋅ Zhuo Cao ⋅ Qin Wang ⋅ Hanno Scharr ⋅ Ira Assent
The Interplay of Harness Design and Post-Training in LLM Agents 🔗
Kyungmin Kim ⋅ Youngbin Choi ⋅ Seoyeon Lee ⋅ Suhyeon Jun ⋅ Dongwoo Kim ⋅ Sangdon Park
Real-world Reinforcement Learning from Suboptimal Interventions 🔗
yinuo zhao ⋅ Huiqian Jin ⋅ Lechun Jiang ⋅ Xinyi Zhang ⋅ Kun Wu ⋅ Pei Ren ⋅ Zhiyuan Xu ⋅ Zhengping Che ⋅ Junjie Ji ⋅ Lei Sun ⋅ Dapeng Wu ⋅ Chi Harold Liu ⋅ Jian Tang
Ethan Y Wang ⋅ Aayan Alwani
Aligning Language Models from User Interactions 🔗
Thomas Kleine Buening ⋅ Jonas Hübotter ⋅ Barna Pasztor ⋅ Idan Shenfeld ⋅ Giorgia Ramponi ⋅ Andreas Krause
Rank-Then-Act: Reward-Free Control from Frame-Order Progress 🔗
Yuriy Maksyuta ⋅ George Bredis ⋅ Ruslan Rakhimov ⋅ Daniil Gavrilov
Making Execution Time a Trainable Reward for Code Generation 🔗
Pierre Chambon ⋅ Kunhao Zheng ⋅ Juliette Decugis ⋅ Benoît Sagot ⋅ Gabriel Synnaeve
Unified Noise Steering for Efficient Human-Guided VLA Adaptation 🔗
Junjie Lu ⋅ Xinyao Qin ⋅ Yuhua Jiang ⋅ Tianxiang Zhang ⋅ Xiaoyu Chen ⋅ Kaixin Wang ⋅ Chuheng Zhang ⋅ Bin Liang ⋅ Jun Yang ⋅ Min Xu ⋅ Li Zhao
Yanwei CUI ⋅ Xing Zhang ⋅ Guanghui Wang ⋅ Peiyang He
Yuanzheng Zhu ⋅ Mingyu Zhao ⋅ Chinmay Savadikar ⋅ Han Li ⋅ Shuang Xie ⋅ Alberto Castelo ⋅ Tianfu Wu ⋅ Lingyun Wang
Mahalanobis-Guided Latent OOD Detection for Hybrid RL-ES Control in Time-Varying Systems 🔗
Shaifalee Saxena ⋅ Alexander Scheinker
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models 🔗
Wen Huang ⋅ Haoran Sun ⋅ Yongjian Guo ⋅ Yunxuan Ma ⋅ Haoran Li ⋅ Jing Long ⋅ Zhouying Mo ⋅ Guan Zhong ⋅ Yucheng Guo ⋅ shuai di ⋅ Xiong J Wu
A Predictive Law for On-Policy Self-Distillation From World Feedback 🔗
Tommy He ⋅ Jerome Sieber ⋅ Matteo Saponati
On-Policy Self-Distillation via Prompt Optimization 🔗
Jongho Park ⋅ Donghyun Lee ⋅ Matei Zaharia ⋅ Jason Lee
Learning to Ideate for Scientific Impact 🔗
Shubham Kale ⋅ Aniketh Garikaparthi ⋅ Manasi Patwardhan
Haochen Wu ⋅ Yi Hou ⋅ Shiguang Xie
Learning Stateful Predictive Knowledge From Experience 🔗
Yan Song ⋅ Xidong Feng ⋅ Bo Liu ⋅ Xinyu Cui ⋅ Zichen Liu ⋅ Haotian Fu ⋅ Mengyue Yang ⋅ Cheng Deng ⋅ Jian Zhao ⋅ Jun Wang
Bayesian Preference Learning for Test-Time Steerable Reward Models 🔗
Jiwoo Hong ⋅ Shao Tang ⋅ Zhipeng Wang
Preference Alignment Improves Information Conveyance in Language Models 🔗
Yuwei Cheng ⋅ Weiyi Tian ⋅ Haifeng Xu
Multi-Rollout On-Policy Distillation via Peer Successes and Failures 🔗
Weichen Yu ⋅ Xiaomin Li ⋅ Yizhou Zhao ⋅ Xiaoze Liu ⋅ RUOWANG ZHANG ⋅ Haixin Wang ⋅ Yinyi Luo ⋅ Chen Wu ⋅ Gaurav Mittal ⋅ Matt Fredrikson ⋅ Yu Hu
Tasha Pais ⋅ Richard E.L. Higgins
Junze Ye ⋅ Jiayi Cheng ⋅ Lu Miao ⋅ Michal Mankowski ⋅ Jose Blanchet ⋅ Mohsen Bayati
Learning Visually-Grounded Active View Selection for Embodied Question Answering 🔗
Juil Koo ⋅ Daehyeon Choi ⋅ Sangwoo Youn ⋅ Phillip Y Lee ⋅ Minhyuk Sung
Encoder-Adapted Sim-to-Real Transfer of Simulation-Trained Diffusion Policies for Robot Manipulation 🔗
Chanhyuk Jung ⋅ Dasom Ahn ⋅ Sungkeun Yoo ⋅ Byoung Chul Ko
Can We Really Learn One Representation to Optimize All Rewards? 🔗
Chongyi Zheng ⋅ Royina Karegoudra Jayanth ⋅ Benjamin Eysenbach
Guilin Zhang ⋅ Summer Sun ⋅ SHAHRYAR SARKANI ⋅ John M Fossaceca
Retroactive Advantage Correction: Closed-Form V-Trace Bias Correction for Delay-Aware RLHF 🔗
Arnav Raj
SignalBench: Comparing Dense Feedback Methods for Long-Horizon Agents 🔗
Sergio Hernández-Gutiérrez ⋅ Matteo Merler ⋅ Ilze Amanda Auzina ⋅ Joschka Strüber ⋅ Ameya Pandurang Prabhu ⋅ Matthias Bethge
MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training 🔗
Wenhan Ma ⋅ Jianyu Wei ⋅ Liang Zhao ⋅ Hailin Zhang ⋅ Bangjun Xiao ⋅ Lei Li ⋅ Qibin Yang ⋅ Bofei Gao ⋅ Yudong Wang ⋅ Rang Li ⋅ Jinhao Dong ⋅ Fuli Luo ⋅ Zhifang Sui
Bridging Physical Reasoning and Task Generalization via Visual Action Outcome Reasoning Alignment 🔗
Han J Ko ⋅ Jr-Jen Chen ⋅ Haobo Yuan ⋅ Hsin-Ying Lee ⋅ Tiancheng SHEN ⋅ Ming-Hsuan Yang ⋅ Yu-Chiang Wang
Learning from World Feedback: Why Model Uncertainty Fails as a Risk Signal in Model-Based RL 🔗
Zhaohui Wang
The Role of Feedback Alignment in Self-Distillation 🔗
Semih Kara ⋅ Oguzhan Ersoy
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems 🔗
Sumeet Motwani ⋅ Chuan Du ⋅ Aleksandar Petrov ⋅ Christopher E Davis ⋅ Phil Torr ⋅ Antonio R Papania-Davis ⋅ Weishi Yan
EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models 🔗
Perry Dong ⋅ Kuo-Han Hung ⋅ Tian Gao ⋅ Dorsa Sadigh ⋅ Chelsea Finn
Manoj Saravanan
ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation 🔗
Jingqi Zhou ⋅ Sheng Wang ⋅ Dezhao Deng ⋅ Junwen Lu ⋅ Junwei Su ⋅ Qintong Li ⋅ Jiahui Gao ⋅ Hao Wu ⋅ Jiyue Jiang ⋅ Lingpeng Kong ⋅ Dunhong Jin ⋅ Chuan Wu
World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments 🔗
Ananya Mantravadi ⋅ Harshit Rajgarhia ⋅ Prasanna Desikan ⋅ Abhishek Mukherji
When Does Non-Uniform Replay Matter in Reinforcement Learning? 🔗
Michał Korniak ⋅ Mikołaj Czarnecki ⋅ Yarden As ⋅ Piotr Milos ⋅ Pieter Abbeel ⋅ Michal Nauman
Training AI Co-Scientists using Rubric Rewards 🔗
Shashwat Goel ⋅ Rishi Hazra ⋅ Dulhan Jayalath ⋅ Timon Willi ⋅ Parag Jain ⋅ Shen ⋅ Ilias Leontiadis ⋅ Francesco Barbieri ⋅ Yoram Bachrach ⋅ Jonas Geiping ⋅ Chenxi Whitehouse
FlowLPS: Langevin-Proximal Sampling for Measurement-Feedback-Guided Generative Inference 🔗
Jonghyun Park ⋅ Jong Chul YE
PRICE-RL: Selection–Transmission Decomposed Reinforcement Learning for Sequential Biological Design 🔗
Bryan Cheng ⋅ Austin Jin ⋅ Jasper Zhang
Playing with Fire : What Transfers When RL Trains a Language Agent? 🔗
Kaousheik Jayakumar ⋅ Mahesh Ramesh ⋅ Hemanth Ram ⋅ Pavan Thodima ⋅ Ramani Duraiswami ⋅ Dinesh Manocha ⋅ Aniket Rege ⋅ Emmanouil-Vasileios Vlatakis-Gkaragkounis
Learning, Fast and Slow: Towards LLMs That Adapt Continually 🔗
Rishabh Tiwari ⋅ Kusha Sareen ⋅ Lakshya A Agrawal ⋅ Joseph E Gonzalez ⋅ Matei Zaharia ⋅ Kurt Keutzer ⋅ Inderjit Dhillon ⋅ Rishabh Agarwal ⋅ Fnu Devvrit
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR 🔗
Anupam Nayak ⋅ Baris Askin ⋅ Muhammed Ustaomeroglu ⋅ Carlee Joe-Wong ⋅ Gauri Joshi
On Learning to Think with Action Process Reward Models 🔗
Michael Zhang ⋅ Madison Ho
Reward, or the Observation Stream? Auditing World Model Quality in RL on a Turbofan Substrate 🔗
Lucas Thil ⋅ Jesse Read ⋅ Rim Kaddah ⋅ Guillaume Doquet
Austin Jin ⋅ Bryan Cheng ⋅ Jasper Zhang
VLFEEDBACK-EEG: Neural Signals as Implicit Feedback for Vision-Language Model Alignment 🔗
Angela Lopez Cardona ⋅ Sebastian Idesis ⋅ Mireia M Bruns ⋅ Matteo Mazzini ⋅ Joemon Jose ⋅ Sergi Abadal ⋅ Ioannis Arapakis
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking 🔗
Vaidehi Bagaria ⋅ Nikshep Grampurohit ⋅ Pulkit Verma
Icey S Ai ⋅ Lily Xu