Frontiers in Stochastic Control and Reinforcement Learning

Stochastic control and reinforcement learning have recently witnessed remarkable growth, fueled by advancements in dynamic programming, approximate inference, deep learning, and high-performance computing. These methods have demonstrated their ability to tackle complex decision-making problems under uncertainty across a wide range of domains, including robotics, autonomous systems, finance, and healthcare. However, despite their successes, several fundamental challenges remain. Key open problems include ensuring robust performance in the presence of noise or adversarial conditions, improving efficiency and scalability, establishing strong theoretical guarantees, and bridging the gap between theoretical advancements and real-world deployment.

To address these challenges and foster cross-disciplinary collaboration, this workshop, held in conjunction with ACM SIGMETRICS 2025, aims to bring together researchers and practitioners from control theory, machine learning, operations research, and related disciplines. The program will feature invited talks from leading experts and opportunities for participants to exchange ideas on recent breakthroughs, open problems, and future directions in stochastic control and reinforcement learning.

Organizers: Gauri Joshi (Carnegie Mellon University), Mehrdad Moharrami (University of Iowa), Srinivas Shakkottai (University of Texas A&M),

Date: Friday, June 13th

Location: Wang Center at Stony Brook University, Room 102

Schedule at a Glance:

A New Lyapunov Approach for Fully Heterogeneous Weakly-coupled Mdps

Website: https://www.cs.cmu.edu/~weinaw

Time: 9:00 am - 9:30 am

Weina Wang (Abstract and Bio)

Abstract: Heterogeneity poses a fundamental challenge for many real-world large-scale decision-making problems but remains largely understudied. In this paper, we study the fully heterogeneous setting of a prominent class of such problems, known as weakly-coupled Markov decision processes (WCMDPs). Each WCMDP consists of N arms (or subproblems), which have distinct model parameters in the fully heterogeneous setting, leading to the curse of dimensionality when N is large. We show that, under mild assumptions, an efficiently computable policy achieves an $O(1/\sqrt{N})$ optimality gap in the long-run average reward per arm for fully heterogeneous WCMDPs as N becomes large. This is the first asymptotic optimality result for fully heterogeneous average-reward WCMDPs. Our main technical innovation is the construction of projection-based Lyapunov functions that certify the convergence of rewards and costs to an optimal region, even under full heterogeneity.

Bio: Weina Wang is an Associate Professor in the Computer Science Department at Carnegie Mellon University. Her research lies in the broad area of applied probability, with a focus on decision-making in large stochastic systems. She was a joint postdoctoral research associate in the Coordinated Science Lab at the University of Illinois at Urbana-Champaign, and in the School of ECEE at Arizona State University, from 2016 to 2018. She received her Ph.D. degree in Electrical Engineering from Arizona State University in 2016, and her Bachelor’s degree from the Department of Electronic Engineering at Tsinghua University in 2009. Her dissertation received the Dean’s Dissertation Award in the Ira A. Fulton Schools of Engineering at Arizona State University in 2016. She received the Kenneth C. Sevcik Outstanding Student Paper Award at ACM SIGMETRICS 2016, the Best Paper Award at ACM MobiHoc 2022, an NSF CAREER award in 2022, and the ACM SIGMETRICS Rising Star Research Award in 2023.

Recurrent Natural Policy Gradient for POMDPs

Website: https://www.atillaeryilmaz.com

Time: 9:30 am - 10:00 am

Atilla Eryilmaz (Abstract and Bio)

Abstract: We study a natural policy gradient method based on recurrent neural networks (RNNs) for partially-observable Markov decision processes, whereby RNNs are used for policy parameterization and policy evaluation to address curse of dimensionality in non-Markovian reinforcement learning. We present finite-time and finite-width analyses for both the critic (recurrent temporal difference learning), and correspondingly-operated recurrent natural policy gradient method in the near-initialization regime. Our analysis demonstrates the efficiency of RNNs for problems with short-term memory with explicit bounds on the required network widths and sample complexity, and points out the challenges in the case of long-term dependencies.

Bio: Atilla Eryilmaz received his M.S. and Ph.D. degrees in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign in 2001 and 2005, respectively. Between 2005 and 2007, he worked as a Postdoctoral Associate at the Laboratory for Information and Decision Systems at the Massachusetts Institute of Technology. Since 2007, he has been at The Ohio State University, where he is currently a Professor and the Graduate Studies Chair of the Electrical and Computer Engineering Department. Dr. Eryilmaz's research interests span optimal control of stochastic networks, machine learning, optimization, and information theory.

Cooperative Multi-Agent Constrained POMDPs: Strong Duality and Primal-Dual Reinforcement Learning with Approximate Information States

Website: https://subramanian.engin.umich.edu

Time: 10:00 am - 10:30 am

Vijay Subramanian (Abstract and Bio)

Abstract: Multi-agent systems appear in many engineering and socioeconomic settings, wherein a group of agents or controllers interact with each other in a shared and possibly non-stationary environment, and make sequential decisions based on their own information using a (causal) interaction-mechanism. In this talk we focus attention on cooperative sequential decision making under uncertainty—a decentralized team, where a fixed finite number of agents act as a team with the common goal of minimizing a long-term cost function. We investigate the general situation where one long-term (objective) cost must be minimized, while maintaining multiple other long-term (constraint) costs within prescribed limits via a cooperative Multi-Agent Constrained Partially Observable Markov Decision Process (MAC-POMDP) model. Such constrained sequential team decision problems arise in several real-world applications where efficient operation must be balanced with maintaining safe operating margins—such considerations arise in communication networks, traffic management, energy-grid optimization, e-commerce pricing, environmental monitoring, etc. We focus on the discounted cost criterion, and start by establishing general results on Lagrangian duality and existence of a global saddle-point. Next, we consider decentralized policy-profiles and their mixtures, and establish that when agents mix jointly over their policy-profiles, there is no (Lagrangian) duality gap, and a global saddle-point exists under the Slater’s condition. However, when agents mix independently over their policy-profiles, we show (through a concrete counterexample) that a non-zero duality gap can exist. Then, we consider coordination policies and their mixtures, and establish that, except for pure coordination policies, they are all equivalent to joint mixtures of decentralized policy-profiles. This equivalence result helps reformulate the original multi-agent constrained optimization problem into a single-agent constrained optimization problem, which is then used to propose a primal-dual framework for model-based optimal control. Finally, we extend the notion of a Multi-Agent Approximate Information State (MA-AIS) to constrained decision making, and formalize MA-AIS based coordination policies and their mixtures. We establish through a concrete counterexample that, in contrast to behavioral coordination policies, MA-AIS based behavioral coordination policies and their mixtures are not equivalent. We also establish approximate optimality of mixtures of MA-AIS based coordination policies, and use this result to guide the development of a data-driven alternative for the aforementioned model-based primal-dual framework.

Bio: Vijay Subramanian received the Ph.D. in electrical engineering from the University of Illinois at Urbana-Champaign in 1999. He worked with Motorola Inc., The Hamilton Institute, Maynooth, Ireland, for many years, and with the Department of Electrical Engineering and Computer Science at Northwestern University. In fall 2014, he joined the Department of Electrical Engineering and Computer Science at the University of Michigan as an associate professor and was promoted to professor in fall 2024. His research interests are in stochastic analysis, random graphs, multiagent systems, and game theory (mechanism and information design) with applications to social, economic, and technological networks.

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Website: https://jess-sorrell.github.io

Time: 11:00 am - 11:30 am

Jessica Sorrell (Abstract and Bio)

Abstract: Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or constituent policies upon which we would like to improve in a scalable manner. In this talk, we will give an algorithm to learn a policy that competes with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value, without prior knowledge of the value functions of the constituent policies. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We will then discuss the algorithm's experimental effectiveness on several robotic simulation testbeds.

Bio: Jessica Sorrell is an Assistant Professor in the Department of Computer Science at Johns Hopkins University. Previously, she was a postdoc at the University of Pennsylvania, working with Aaron Roth and Michael Kearns. She completed her PhD in Computer Science at the University of California, San Diego, advised by Daniele Micciancio and Russell Impagliazzo. She is broadly interested in the theoretical foundations of responsible computing, particularly questions related to fairness, privacy, robustness, and reproducibility.

A General-Purpose Theorem for High-Probability Convergence Bounds of Stochastic Approximation with Polyak Averaging

Website: https://sites.google.com/view/sajadkhodadadian/home

Time: 11:30 am - 12:00 pm

Sajad Khodadadian (Abstract and Bio)

Abstract: Polyak averaging is a well-known technique for achieving asymptotically optimal convergence in Stochastic Approximation. In this work, we establish the first high-probability bound for general Stochastic Approximation with Polyak Averaging. We take a black-box approach, assuming access to an anytime high-probability bound for a given Stochastic Approximation, and derive tight finite-time bounds for its Polyak-averaged version. Applying our black-box framework to general contractive Stochastic Approximation, we analyze the impact of averaging under various settings.

Bio: Sajad Khodadadian is an Assistant Professor in the Grado Department of Industrial and Systems Engineering in Virginia Tech. Dr. Khodadadian’s research is on design and analysis of Reinforcement Learning algorithms. He received his PhD in Operations Research from School of Industrial and Systems Engineering in Georgia Institute of Technology.

Online Learning in Risk Sensitive Constrained MDP

Website: https://sites.google.com/view/moharrami/home

Time: 12:00 pm - 12:30 pm

Mehrdad Moharrami (Abstract and Bio)

Abstract: We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of Õ(V_{g,max}·K^{3/4} + √(H^4·S^2·A·log(1/δ))·K^{3/4}) and constraint violation of Õ(V_{g,max}·√(H^3·S^2·A·log(1/δ))·K^{3/4}) with probability at least 1−δ, where S and A denote the cardinalities of the state and action spaces, respectively, H is the episode length, K is the number of episodes, α < 0 is the risk-aversion parameter, and V_{g,max} = (1/|α|)(exp(|α|·H) − 1). To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem.

Bio: Mehrdad Moharrami is an Assistant Professor of Computer Science at the University of Iowa. He received his Ph.D. in Electrical Engineering from the University of Michigan, where he was awarded the Rackham Predoctoral Fellowship. Prior to joining Iowa, he was a TRIPODS Postdoctoral Research Fellow at the University of Illinois at Urbana-Champaign, working with Prof. R. Srikant. His research focuses on reinforcement learning, Markov decision processes, and algorithms for learning and economics in networked systems modeled as random graphs.

Reasoning and Acting with LLM: An RL Framework

Website: https://zhaoranwang.github.io

Time: 13:30 pm - 14:00 pm

Zhaoran Wang (Abstract and Bio)

Abstract: We explore two complementary frameworks that harness reinforcement learning (RL) to unlock the full potential of large language models (LLMs) for “reasoning” (rationale generation) and “acting” (decision making).

- The first part introduces RAFA — a prompting mechanism that enables autonomous LLM agents to “reason for the future, act for now” in complex long-horizon decision-making tasks. In particular, we implement the reasoning process as learning and planning in Bayesian-adaptive Markov decision processes (MDPs), where the learning and planning subroutines employ in-context learning to emulate actor-critic updates. By guiding short-term acting with long-term reasoning, RAFA achieves provable regret guarantees while outperforming most existing approaches.

- The second part introduces BRiTE — a training paradigm that enhances the reasoning ability of LLMs. In particular, we view the reasoning process via a latent variable model, which yields a bootstrapping strategy that resembles the expectation-maximization (EM) algorithm. Here, the “E-step” samples rationales from the base model via RL in token-level MDPs, while the “M-step” distills rationales into the base model via supervised learning (SL). Through the RL-SL update, BRiTE enjoys provable convergence guarantees while providing significant empirical improvements, especially on various challenging benchmarks, without requiring expensive human-annotated rationales.

Together, RAFA and BRiTE showcase two distinct yet synergistic applications of RL — at the task level and the token level — for enhancing both the decision-making and rationale-generating abilities of LLMs.

Bio: Zhaoran Wang is an associate professor at Northwestern University, working at the interface of machine learning, statistics, and optimization. He is the recipient of the AISTATS (Artificial Intelligence and Statistics Conference) notable paper award, ASA (American Statistical Association) best student paper in statistical learning and data mining, INFORMS (Institute for Operations Research and the Management Sciences) best student paper finalist in data mining, Microsoft Ph.D. Fellowship, Simons-Berkeley/J.P. Morgan AI Research Fellowship, Amazon Machine Learning Research Award, and NSF CAREER Award.

Direct Online Preference Learning for Restless Bandits

Website: https://sites.google.com/stonybrook.edu/jianli

Time: 14:00 pm - 14:30 pm

Jian Li (Abstract and Bio)

Abstract: Restless multi-armed bandits (RMABs) are widely used for constrained sequential decision-making, where each arm evolves via a Markov chain and generates scalar rewards. However, specifying a precise reward function is often difficult or infeasible in practice. To address this, we introduce Pref-RMAB, a new RMAB model that relies on pairwise preference feedback rather than scalar rewards. While preference feedback provides less information, we propose DOPL (Direct Online Preference Learning), an algorithm that efficiently explores the environment, collects preference data online, and directly leverages it for decision-making. We prove that DOPL achieves the first sublinear regret bound for RMABs with preference signals. Experiments validate its effectiveness.

Bio: Jian Li is an Assistant Professor of Data Science at Stony Brook University. He received his Ph.D. in Computer Engineering from Texas A&M University in 2016, and his B.E. from Shanghai Jiao Tong University in 2012. His current research interests lie at the intersection of algorithms for reinforcement learning and multi-armed bandits, decentralized/federated learning, and stochastic optimization and control, with applications to next-generation networked systems. He is a recipient of the NSF CAREER Award (2024) and NSF CISE CRII Award (2021).

Asymptotic Optimality in Restless Bandit: The Model Predictive Control Approach

Website: https://polaris.imag.fr/nicolas.gast

Time: 15:30 pm - 16:00 pm

Nicolas Gast (Abstract and Bio)

Abstract: Restless bandit and weakly coupled MDPs are classical modeling tools to model resource allocation problems (for instance in scheduling or queuing networks). These problems are computationally hard to solve and there has been a number of heuristics proposed in the literature, based on linear relaxation and mean field control. While some of them date from the 90s (the famous Whittle index), there has been a surge of recent development in this area. This talk will discuss these later developments: the what, the how and the why.

Bio: Nicolas Gast is a research associate at Inria, Grenoble. His research interests focus on stochastic models and stochastic optimization with a particular emphasis on mean field approximation methods. He is an expert on the use Stein's method to compute the approximation error of mean field or fluid approximation. The applications of his work spans stochastic networks, stochastic optimization (Markov decision processes, online learning, reinforcement learning), and design of control algorithms for large-scale systems (such as communication networks and electricity networks).

Sampling-based Methods for Non-convex Optimization: From Theory to Robotics

Website: http://www.guannanqu.com

Time: 14:30 pm - 15:00 pm

Guannan Qu (Abstract and Bio)

Abstract: In this talk, we focus on a class of sampling-based optimization (SBO) algorithms that have gained remarkable empirical success in robotics optimal control problems that are non-convex and even discontinuous, often defying conventional wisdom that such optimizations are hard. We first reveal that these SBO algorithms are in fact zeroth order gradient methods for a smoothed version of the objective function controlled by a smoothing parameter decided by the sampling variance. Then, we provide a characterization of the landscape of the smoothed objective: smoothing renders the landscape more benign by enlarging the convex region around the global minimizer, but at the cost of introducing an optimality gap. Building on this insight, we establish non-asymptotic convergence guarantees for SBO algorithms to a neighborhood of the global minimizer. Further, we propose new SBO algorithms with diffusion-style annealing that lead to the first optimization-based method that optimizes over full-order quadruped dynamics in real-time.

Bio: Guannan Qu has been an Assistant Professor at the Electrical and Computer Engineering Department of Carnegie Mellon University since September 2021. He received his B.S. degree in Electrical Engineering from Tsinghua University in Beijing, China in 2014, and his Ph.D. in Applied Mathematics from Harvard University in Cambridge, MA in 2019. He was a CMI and Resnick postdoctoral scholar in the Department of Computing and Mathematical Sciences at California Institute of Technology from 2019 to 2021. He is the recipient of NSF CAREER Award, IEEE Transactions on Smart Grid Top Papers Award, Caltech Simoudis Discovery Award, PIMCO Fellowship, Amazon AI4Science Fellowship, and IEEE SmartGridComm Best Student Paper Reward. His research interest lies in control, optimization, and machine/reinforcement learning with applications to power systems, multi-agent systems, Internet of things, smart city, etc.

A New Design Tool for Bayesian Optimization Algorithms: The Gittins Index

Website: https://ziv.codes

Time: 16:00 pm - 16:30 pm

Ziv Scully (Abstract and Bio)

Abstract: Bayesian optimization is a technique for optimizing an unknown function with a limited number of function evaluations. It works by modeling the unknown function as a stochastic (typically Gaussian) process, then applying Bayesian updates after each function evaluation. An example application is tuning hyperparameters of a neural network, where the unknown function maps hyperparameter values to the quality of the trained network. To address practical concerns, there are many variants of Bayesian optimization that build on top of this classical setup. For example, different function evaluations might have different costs (e.g. deeper networks take longer to train), or there may be some partial feedback available before function evaluation is complete (e.g. if the test set error of a partially trained network is poor, one may stop training early). Designing Bayesian optimization algorithms for these settings remains a challenge.

We introduce a new connection between Bayesian optimization and a related problem from the operations research literature, called the Pandora's box problem, and explore the implications of this connection for designing Bayesian optimization algorithms. Roughly speaking, the Pandora's box problem is a version of Bayesian optimization where the unknown function's values at distinct inputs are assumed to be mutually independent. Pandora's box is significant because it admits a Bayesian-optimal solution, called the Gittins index policy. Moreover, the Gittins index naturally extends to handle features such as heterogenous evaluation costs and partial feedback during evaluation. We translate the Gittins index to the Bayesian optimization, and we demonstrate empirically that it performs well in the classical and heterogenous-costs settings. We conclude with a broader discussion of how the Gittins index could help design algorithms for settings with partial feedback and other challenges.

Bio: Ziv Scully is an assistant professor at Cornell ORIE (Operations Research and Information Engineering). Prior to joining Cornell, he completed his PhD in Computer Science at CMU in 2022, advised by Mor Harchol-Balter and Guy Blelloch, then spent a postdoctoral year split between the UC Berkeley Simons Institute, Harvard SEAS, and MIT CSAIL. Ziv researches the theory of decision making under uncertainty and stochastic control, with a particular emphasis on scheduling and load balancing in queueing systems. His work has been recognized by awards from ACM SIGMETRICS (e.g. 2024 Best Paper), IFIP PERFORMANCE (e.g. 2023 Best Paper), and INFORMS (e.g. 2022 Nicholson Competition, first place).

Page updated

Google Sites

Report abuse