Stochastic control and reinforcement learning have seen rapid growth in recent years, driven by advances in dynamic programming, approximate inference, deep learning, and high-performance computing. These developments have enabled increasingly effective approaches to sequential decision-making under uncertainty, with impactful applications in robotics and autonomous systems, networking and resource allocation, finance, and healthcare. Despite this progress, several core challenges remain open. These include designing algorithms that are robust to noise, distribution shift, and adversarial effects; improving sample efficiency and computational scalability; establishing stronger theoretical guarantees for modern methods; and translating theoretical insights into reliable, deployable systems.
To help address these challenges and catalyze cross-disciplinary collaboration, this workshop, held in conjunction with ACM SIGMETRICS 2026, will bring together researchers and practitioners from stochastic control, reinforcement learning, optimization, operations research, and adjacent areas. The program will feature invited talks by leading speakers and structured opportunities for discussion and community engagement. In particular, we will host panel discussions at the end of each half-day session to facilitate exchange of perspectives on recent breakthroughs, open problems, and promising directions for future research, and to encourage interaction between the speakers and the audience.
Organizers: Gauri Joshi (Carnegie Mellon University), Mehrdad Moharrami (University of Iowa), Srinivas Shakkottai (University of Texas A&M),
Date: Friday, June 12th
Location: Wolverine Room, 3rd Floor, Michigan Union, 530 S State St, Ann Arbor, MI 48109
Schedule at a Glance:
Time: 9:00 am - 9:30 am
Abstract: We study the problem of average reward Multi-Agent Reinforcement Learning (MARL) where agents interact within a network. Each agent in the network conducts local updates with information within its $k$-hop neighborhood and collaboratively maximizes the overall reward of the entire network. We first provide impossibility results in achieving convergence to the global optimal joint policies in our setting. These impossibility results highlight the challenges of decentralized learning with local information in multi-agent systems, namely the issues of \emph{multiagency}, where each agent independently optimizes its policy, and \emph{partial observability}, where agents have limited visibility into the global state. Given these challenges which indicate that decentralized policy optimization can generally be certified only up to stationarity, we provide finite-sample convergence guarantees for a decentralized actor-critic algorithm with linear function approximation that converges to a locally optimal solution with small gradient norms. To overcome such impossibility results, we further show that adding fixed-state entropy regularization reshapes the objective landscape which leads to even stronger global convergence guarantees for our proposed actor-critic algorithm.
Biography: Yashaswini Murthy is a postdoctoral scholar in Computing and Mathematical Sciences at Caltech. She will join the University of Texas at Austin as an Assistant Professor in Operations Research and Industrial Engineering in Fall 2026. She received her Ph.D. in Electrical and Computer Engineering from the University of Illinois Urbana-Champaign. Prior to that, she earned her Bachelor’s and Master’s degrees in Mechanical Engineering from the Indian Institute of Technology Bombay. Her research focuses on long-term decision-making, with an emphasis on average-reward formulations and distributional robustness. Her work has been recognized with several honors, including the Mavis Future Faculty Fellowship and the Joan and Lalit Bahl Fellowship from UIUC, as well as recognition as a Rising Star in EECS and in ISyE–MS&E–IOE.
Time: 9:30 am - 10:00 am
Abstract: Policy gradient (PG) and natural policy gradient (NPG) methods are central tools in modern reinforcement learning, leading to extensive study of their convergence guarantees. While substantial progress has been made, existing analyses either do not provide sharp guarantees (e.g., distribution-free global geometric convergence), or obtain such guarantees only by introducing additional regularization or relying on adaptive stepsizes. A key challenge underlying these limitations is that, in contrast to classical dynamic programming methods such as value iteration and policy iteration, gradient-based approaches are often viewed as departing from Bellman-operator-based algorithm design and analysis. In this talk, we present a new perspective showing that NPG belongs to a unified algorithmic framework, which we term doubly smoothed policy iteration. This framework not only includes NPG, but also captures policy iteration, dual-averaged policy iteration, and general policy dual averaging as special cases. This viewpoint allows us to leverage the contraction and monotonicity properties of the Bellman operator, leading to strong convergence guarantees for a broad class of algorithms through a simpler analysis. Importantly, we believe that viewing gradient-based methods as variants of policy iteration helps clarify their relationship with classical dynamic programming, and may potentially lead to new algorithm design and analysis in other settings, such as average-reward and stochastic shortest path MDPs, as well as stochastic games.
Bio: Zaiwei Chen is an Assistant Professor in the School of Industrial Engineering at Purdue University and a Purdue Solberg Academic Excellence Scholar. He received his Ph.D. from the Georgia Institute of Technology and was a postdoctoral scholar at the California Institute of Technology. His interest is in addressing real-world sequential decision-making challenges by developing a foundational understanding of how to design learning-based algorithms with provable performance guarantees.
Abstract: Reward inference (learning a reward model from human preferences) is a critical intermediate step in Preference-based Reinforcement Learning (PbRL), such as Reinforcement Learning from Human Feedback (RLHF) for fine-tuning Large Language Models (LLMs). In practice, reward inference faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which offers a much simpler pipeline but only works in the bandit setting or for deterministic MDPs. This talk introduces new algorithms for stochastic MDPs and general preference models (link functions). The key idea is a sign-based policy perturbation approach based on zeroth-order optimization. We will discuss its applications to unknown link functions and federated RLHF.
Bio: Lei Ying is a Professor in the Electrical Engineering and Computer Science Department at the University of Michigan, Ann Arbor. He is an IEEE Fellow and an Editor-at-Large for the IEEE/ACM Transactions on Networking. His research focuses on the interplay between complex stochastic systems and big data, including reinforcement learning, large-scale communication/computing systems for big-data processing, private data marketplaces, and large-scale graph mining.
Abstract: Reinforcement learning is the method of choice to train models in sampling-based setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower-order approximation. Inspired by this observation, we introduce Maximum Likelihood Reinforcement Learning (MaxRL), a sampling-based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non-differentiable sampling by defining a compute-indexed family of sample-based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy-gradient estimator and converge to maximum likelihood optimization in the infinite-compute limit. Empirically, we show that MaxRL Pareto-dominates existing methods in all models and tasks we tested, achieving up to 20x test-time scaling efficiency gains compared to its GRPO-trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.
Bio: Andrea Zanette is an Assistant Professor in the ECE department at Carnegie Mellon University, with a courtesy appointment in the Machine Learning Department. His research broadly focuses on foundation models, spanning both theory and practice, with interests including reasoning, alignment, efficiency, and optimization. Prior to joining CMU in Fall 2024, he was a postdoctoral scholar at UC Berkeley, where he collaborated with Martin Wainwright, Peter Bartlett, and Sergey Levine. He earned his PhD from Stanford University under the supervision of Emma Brunskill and Mykel J. Kochenderfer, establishing several fundamental results in reinforcement learning theory. He also collaborated closely with researchers at Microsoft Research and Meta AI.
Time: 11:30 am - 12:00 pm
Abstract: We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. We first study a UCBVI-type (model-based) algorithm and characterize the tail distribution of the cumulative regret over all episodes via explicit bounds on the probability of the regret exceeding a specific value, going beyond analyses limited to expected regret or a single high-probability quantile. We analyze two natural exploration-bonus schedules for UCBVI: (i) a schedule dependent on the total number of episodes that explicitly incorporates this total, and (ii) an anytime schedule (independent of the total) that depends only on the current episode index. We then complement the model-based results with an analysis of optimistic Q-learning (model-free) under a schedule dependent on the total number of episodes.
Across both the model-based and model-free settings, we obtain upper bounds on the probability of the regret exceeding a given value, revealing a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected cumulative regret. The proposed algorithms depend on a tuning parameter, alpha, which balances the expected regret and the range over which the regret exhibits sub-Gaussian decay.
Bio: Mehrdad Moharrami is an Assistant Professor in Computer Science at the University of Iowa. Previously, he was a TRIPODS Postdoctoral Research Fellow at the University of Illinois at Urbana Champaign. He received his B.Sc. from the Sharif University of Technology, Iran, in Mathematics, as well as Electrical Engineering. He holds an M.Sc. in Electrical Engineering and an M.Sc. in Mathematics from the University of Michigan. He received his Ph.D. in Electrical Engineering in Winter 2020, for which he was awarded the Rackham Predoctoral Fellowship for an outstanding dissertation. His research interests are Markov decision processes, reinforcement learning, and random graph models for economics, learning, and computation.
Website: https://seanrsinclair.github.io/
Time: 13:30 pm - 14:00 pm
Abstract: We propose an epoch-based reinforcement learning algorithm for infinite-horizon average-cost Markov decision processes (MDPs) that leverages a partial order over a policy class. In this structure, one policy is “informative” for another if data collected under the former policy can be used to counterfactually estimate the performance of the latter policy. Leveraging this partial order, we obtain a regret guarantee that scales with the width of the partial order. Notably, the bound is independent of the state and action space sizes. We illustrate the applicability of this framework across several operations research domains, including inventory control and queueing systems.
Bio: Sean Sinclair is an Assistant Professor in the Department of Industrial Engineering and Management Sciences at Northwestern University. He received his PhD from Cornell University, where he was coadvised by Christina Lee Yu and Siddhartha Banerjee, and was previously a postdoctoral associate at the Massachusetts Institute of Technology, working with Devavrat Shah and Ali Jadbabaie. His research focuses on data-driven sequential decision making, bridging reinforcement learning and operations management. Before graduate school, he served in the Peace Corps teaching in rural Ghana.
Website: https://laixishi.github.io/
Time: 14:00 pm - 14:30 pm
Abstract: Decision-making artificial intelligence (AI) has revolutionized human life ranging from healthcare, daily life, to scientific discovery. However, current AI systems often lack reliability and are highly vulnerable to small changes in complex, interactive, and dynamic environments. My research focuses on achieving both reliability and learning efficiency simultaneously when building AI solutions. These two goals seem conflicting, as enhancing robustness against variability often leads to more complex problems that requires more data and computational resources, at the cost of learning efficiency. But does it have to?
In this talk, I overview my work on building reliable decision-making AI without sacrificing learning efficiency, offering insights into effective optimization problem design for reliable AI. To begin, I will focus on reinforcement learning (RL) — a key framework for sequential decision-making, and demonstrate how distributional robustness can be achieved provably without paying statistical premium (additional training data cost) compared to non-robust counterparts. Next, shifting to decision-making in strategic multi-agent systems, I will demonstrate that incorporating realistic risk preferences—a key feature of human decision-making—enables computational tractability, a benefit not present in traditional models. Finally, I will present some interesting future directions for building reliable, learning-efficient AI solutions for human-centered applications, though agentic and multi-agentic AI systems.
Bio: Laixi Shi is an Assistant Professor of Electrical and Computer Engineering at Johns Hopkins University and is affiliated with the Data Science and Artificial Intelligence Institute. She received her Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University (CMU) in August 2023; her dissertation received the CMU ECE A.G. Milnes Award. Prior to joining Johns Hopkins, she was a postdoctoral fellow in the Department of Computing and Mathematical Sciences at the California Institute of Technology (Caltech). She holds a B.S. in Electronic Engineering from Tsinghua University (2014–2018) and has conducted research with the Google Research Brain Team and Mitsubishi Electric Research Laboratories. Her research focuses on robust and data-efficient decision making at the intersection of data science, optimization, and statistics, with an emphasis on reinforcement learning and inverse problems. Her work spans both theoretical foundations and practical applications. She has been honored with five Rising Star awards across electrical engineering, computer science, machine learning, signal processing, and computational data science. She serves as a Program Committee Member for Learning for Dynamics & Control Conference (L4DC) and area chair for Conference on Parsimony and Learning (CPAL).
Time: 14:30 pm - 15:00 pm
Abstract: Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4\% success on ALFWorld and 83.3\% on WebShop with Qwen2.5-7B-Instruct (+6.6\% and +8.3\% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.
Bio: Mingyi Hong is an Associate professor in the Department of Electrical and Computer Engineering at the University of Minnesota, Minneapolis. His research has been focused on developing optimization theory and algorithms for applications in signal processing, machine learning and foundation models. His work has received two IEEE Signal Processing Society Best Paper Awards (2021, 2022), an International Consortium of Chinese Mathematicians Best Paper Award (2020), among others. He is an Amazon Scholar, and he is the recipient of the 2022 Pierre-Simon Laplace Early Career Technical Achievement Award from IEEE, and the 2025 Egon Balas Prize from INFORMS Optimization Society. He is a Fellow of IEEE.
Time: 15:30 pm - 16:00 pm
Abstract: We study a multi-class single-server (M/M/1) queue with non-idling scheduling control and long-run average holding costs, and develop a formulation that connects queueing control to stochastic optimal transport (SOT) through the Basic Adjoint Relationship (BAR). We call this the BAR-SOT. Under a stationary control policy u, the controlled queue-length process is a continuous-time Markov chain whose invariant distribution (πᵘ) satisfies the BAR, i.e., a weak stationarity condition (∑ₓ πᵘ(x) Gᵘ f(x) = 0) for all test functions f. We show that this constraint is the discrete-event, stationary version of Toshio Mikami’s weak Fokker–Planck constraints in SOT, allowing the scheduling problem to be recast as a BAR-constrained SOT in which the Markov transition kernel acts as a self-coupling transport plan for the stationary measure (πᵘ). The resulting optimization is linear in πᵘ with an infinite family of weak constraints, and admits a dual characterization in terms of the average-cost Poisson's equation. This perspective yields a geometric interpretation of optimal scheduling as a gradient-flow-like selection rule driven by discrete potentials, recovers the classical cμ-rule under linear costs, and predicts state-dependent switching surfaces under nonlinear costs. This is joint work with Sharan Srinivasan and Berke Turkay.
Bio: Harsha Honnappa is an Associate Professor in the Edwardson School of Industrial Engineering at Purdue University, where he directs the Stochastic Systems Lab. He is an applied probabilist with interests in the modeling and analysis of stochastic systems, theoretical statistics, stochastic optimization and control. His research is supported by multiple grants from the National Science Foundation, including an NSF CAREER award, the Office of Naval Research, the Purdue Research Foundation, and through the Edwardson School of Industrial Engineering’s Frontiers awards. He is currently Area Editor for Stochastic Models for Operations Research Letters, and Associate Editor at Operations Research and Queueing Systems Journals.
Time: 16:00 pm - 16:30 pm
Abstract: As LLM reasoning performance plateau, improving inference-time compute efficiency is crucial to mitigate overthinking and long thinking traces even for simple queries. Prior approaches including length regularization, adaptive routing, and difficulty-based budget allocation, primarily focus on single-turn settings and fail to address the sequential dependencies inherent in multi-turn reasoning. In this work, we formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process. We propose TAB: Turn-Adaptive Budgets, a budget allocation policy trained via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints. Consequently, TAB takes as input the conversation history and learns to adaptively allocate smaller budgets to easier turns and save appropriate number of tokens for the crucial harder reasoning steps. Our experiments on mathematical reasoning benchmarks demonstrate that TAB achieves a superior accuracy-tokens tradeoff saving up to 35% tokens while maintaining accuracy over static and off-the-shelf LLM budget baselines. Further, for systems where a plan of all sub-questions is available apriori, we propose TAB All-SubQ, a budget allocation policy that budgets tokens based on the conversation history and all past and future sub-questions, saving up to 40% tokens over baselines.
Bio: Gauri Joshi is an associate professor in the ECE department at Carnegie Mellon University. Gauri completed her Ph.D. from MIT EECS, and received her B.Tech and M.Tech from the Indian Institute of Technology (IIT) Bombay. Her awards include the MIT Technology Review 35 under 35 Award, ONR Young Investigator and NSF CAREER Award, Best Paper awards at MobiHoc 2022 and SIGMETRICS 2020, and the Institute Gold Medal of IIT Bombay (2010).