Stochastic control and reinforcement learning have seen rapid growth in recent years, driven by advances in dynamic programming, approximate inference, deep learning, and high-performance computing. These developments have enabled increasingly effective approaches to sequential decision-making under uncertainty, with impactful applications in robotics and autonomous systems, networking and resource allocation, finance, and healthcare. Despite this progress, several core challenges remain open. These include designing algorithms that are robust to noise, distribution shift, and adversarial effects; improving sample efficiency and computational scalability; establishing stronger theoretical guarantees for modern methods; and translating theoretical insights into reliable, deployable systems.
To help address these challenges and catalyze cross-disciplinary collaboration, this workshop, held in conjunction with ACM SIGMETRICS 2026, will bring together researchers and practitioners from stochastic control, reinforcement learning, optimization, operations research, and adjacent areas. The program will feature invited talks by leading speakers and structured opportunities for discussion and community engagement. In particular, we will host panel discussions at the end of each half-day session to facilitate exchange of perspectives on recent breakthroughs, open problems, and promising directions for future research, and to encourage interaction between the speakers and the audience.
Organizers: Gauri Joshi (Carnegie Mellon University), Mehrdad Moharrami (University of Iowa), Srinivas Shakkottai (University of Texas A&M),
Date: Friday, June 12th
Location: Ann Arbor, Michigan
Abstract: Reward inference (learning a reward model from human preferences) is a critical intermediate step in Preference-based Reinforcement Learning (PbRL), such as Reinforcement Learning from Human Feedback (RLHF) for fine-tuning Large Language Models (LLMs). In practice, reward inference faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which offers a much simpler pipeline but only works in the bandit setting or for deterministic MDPs. This talk introduces new algorithms for stochastic MDPs and general preference models (link functions). The key idea is a sign-based policy perturbation approach based on zeroth-order optimization. We will discuss its applications to unknown link functions and federated RLHF.
Bio: Lei Ying is a Professor in the Electrical Engineering and Computer Science Department at the University of Michigan, Ann Arbor. He is an IEEE Fellow and an Editor-at-Large for the IEEE/ACM Transactions on Networking. His research focuses on the interplay between complex stochastic systems and big data, including reinforcement learning, large-scale communication/computing systems for big-data processing, private data marketplaces, and large-scale graph mining.
Abstract: As LLM reasoning performance plateau, improving inference-time compute efficiency is crucial to mitigate overthinking and long thinking traces even for simple queries. Prior approaches including length regularization, adaptive routing, and difficulty-based budget allocation, primarily focus on single-turn settings and fail to address the sequential dependencies inherent in multi-turn reasoning. In this work, we formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process. We propose TAB: Turn-Adaptive Budgets, a budget allocation policy trained via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints. Consequently, TAB takes as input the conversation history and learns to adaptively allocate smaller budgets to easier turns and save appropriate number of tokens for the crucial harder reasoning steps. Our experiments on mathematical reasoning benchmarks demonstrate that TAB achieves a superior accuracy-tokens tradeoff saving up to 35% tokens while maintaining accuracy over static and off-the-shelf LLM budget baselines. Further, for systems where a plan of all sub-questions is available apriori, we propose TAB All-SubQ, a budget allocation policy that budgets tokens based on the conversation history and all past and future sub-questions, saving up to 40% tokens over baselines.
Bio: Gauri Joshi is an associate professor in the ECE department at Carnegie Mellon University. Gauri completed her Ph.D. from MIT EECS, and received her B.Tech and M.Tech from the Indian Institute of Technology (IIT) Bombay. Her awards include the MIT Technology Review 35 under 35 Award, ONR Young Investigator and NSF CAREER Award, Best Paper awards at MobiHoc 2022 and SIGMETRICS 2020, and the Institute Gold Medal of IIT Bombay (2010).
Abstract: Policy gradient (PG) and natural policy gradient (NPG) methods are central tools in modern reinforcement learning, leading to extensive study of their convergence guarantees. While substantial progress has been made, existing analyses either do not provide sharp guarantees (e.g., distribution-free global geometric convergence), or obtain such guarantees only by introducing additional regularization or relying on adaptive stepsizes. A key challenge underlying these limitations is that, in contrast to classical dynamic programming methods such as value iteration and policy iteration, gradient-based approaches are often viewed as departing from Bellman-operator-based algorithm design and analysis. In this talk, we present a new perspective showing that NPG belongs to a unified algorithmic framework, which we term doubly smoothed policy iteration. This framework not only includes NPG, but also captures policy iteration, dual-averaged policy iteration, and general policy dual averaging as special cases. This viewpoint allows us to leverage the contraction and monotonicity properties of the Bellman operator, leading to strong convergence guarantees for a broad class of algorithms through a simpler analysis. Importantly, we believe that viewing gradient-based methods as variants of policy iteration helps clarify their relationship with classical dynamic programming, and may potentially lead to new algorithm design and analysis in other settings, such as average-reward and stochastic shortest path MDPs, as well as stochastic games.
Bio: Zaiwei Chen is an Assistant Professor in the School of Industrial Engineering at Purdue University and a Purdue Solberg Academic Excellence Scholar. He received his Ph.D. from the Georgia Institute of Technology and was a postdoctoral scholar at the California Institute of Technology. His interest is in addressing real-world sequential decision-making challenges by developing a foundational understanding of how to design learning-based algorithms with provable performance guarantees.
Abstract: We study the problem of average reward Multi-Agent Reinforcement Learning (MARL) where agents interact within a network. Each agent in the network conducts local updates with information within its $k$-hop neighborhood and collaboratively maximizes the overall reward of the entire network. We first provide impossibility results in achieving convergence to the global optimal joint policies in our setting. These impossibility results highlight the challenges of decentralized learning with local information in multi-agent systems, namely the issues of \emph{multiagency}, where each agent independently optimizes its policy, and \emph{partial observability}, where agents have limited visibility into the global state. Given these challenges which indicate that decentralized policy optimization can generally be certified only up to stationarity, we provide finite-sample convergence guarantees for a decentralized actor-critic algorithm with linear function approximation that converges to a locally optimal solution with small gradient norms. To overcome such impossibility results, we further show that adding fixed-state entropy regularization reshapes the objective landscape which leads to even stronger global convergence guarantees for our proposed actor-critic algorithm.
Biography: Yashaswini Murthy is a postdoctoral scholar in Computing and Mathematical Sciences at Caltech. She will join the University of Texas at Austin as an Assistant Professor in Operations Research and Industrial Engineering in Fall 2026. She received her Ph.D. in Electrical and Computer Engineering from the University of Illinois Urbana-Champaign. Prior to that, she earned her Bachelor’s and Master’s degrees in Mechanical Engineering from the Indian Institute of Technology Bombay. Her research focuses on long-term decision-making, with an emphasis on average-reward formulations and distributional robustness. Her work has been recognized with several honors, including the Mavis Future Faculty Fellowship and the Joan and Lalit Bahl Fellowship from UIUC, as well as recognition as a Rising Star in EECS and in ISyE–MS&E–IOE.
Abstract: Decision-making artificial intelligence (AI) has revolutionized human life ranging from healthcare, daily life, to scientific discovery. However, current AI systems often lack reliability and are highly vulnerable to small changes in complex, interactive, and dynamic environments. My research focuses on achieving both reliability and learning efficiency simultaneously when building AI solutions. These two goals seem conflicting, as enhancing robustness against variability often leads to more complex problems that requires more data and computational resources, at the cost of learning efficiency. But does it have to?
In this talk, I overview my work on building reliable decision-making AI without sacrificing learning efficiency, offering insights into effective optimization problem design for reliable AI. To begin, I will focus on reinforcement learning (RL) — a key framework for sequential decision-making, and demonstrate how distributional robustness can be achieved provably without paying statistical premium (additional training data cost) compared to non-robust counterparts. Next, shifting to decision-making in strategic multi-agent systems, I will demonstrate that incorporating realistic risk preferences—a key feature of human decision-making—enables computational tractability, a benefit not present in traditional models. Finally, I will present some interesting future directions for building reliable, learning-efficient AI solutions for human-centered applications, though agentic and multi-agentic AI systems.
Bio: Laixi Shi is an Assistant Professor of Electrical and Computer Engineering at Johns Hopkins University and is affiliated with the Data Science and Artificial Intelligence Institute. She received her Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University (CMU) in August 2023; her dissertation received the CMU ECE A.G. Milnes Award. Prior to joining Johns Hopkins, she was a postdoctoral fellow in the Department of Computing and Mathematical Sciences at the California Institute of Technology (Caltech). She holds a B.S. in Electronic Engineering from Tsinghua University (2014–2018) and has conducted research with the Google Research Brain Team and Mitsubishi Electric Research Laboratories. Her research focuses on robust and data-efficient decision making at the intersection of data science, optimization, and statistics, with an emphasis on reinforcement learning and inverse problems. Her work spans both theoretical foundations and practical applications. She has been honored with five Rising Star awards across electrical engineering, computer science, machine learning, signal processing, and computational data science. She serves as a Program Committee Member for Learning for Dynamics & Control Conference (L4DC) and area chair for Conference on Parsimony and Learning (CPAL).
Website: https://seanrsinclair.github.io/
Abstract: We propose an epoch-based reinforcement learning algorithm for infinite-horizon average-cost Markov decision processes (MDPs) that leverages a partial order over a policy class. In this structure, one policy is “informative” for another if data collected under the former policy can be used to counterfactually estimate the performance of the latter policy. Leveraging this partial order, we obtain a regret guarantee that scales with the width of the partial order. Notably, the bound is independent of the state and action space sizes. We illustrate the applicability of this framework across several operations research domains, including inventory control and queueing systems.
Bio: Sean Sinclair is an Assistant Professor in the Department of Industrial Engineering and Management Sciences at Northwestern University. He received his PhD from Cornell University, where he was coadvised by Christina Lee Yu and Siddhartha Banerjee, and was previously a postdoctoral associate at the Massachusetts Institute of Technology, working with Devavrat Shah and Ali Jadbabaie. His research focuses on data-driven sequential decision making, bridging reinforcement learning and operations management. Before graduate school, he served in the Peace Corps teaching in rural Ghana.
Abstract: We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. We first study a UCBVI-type (model-based) algorithm and characterize the tail distribution of the cumulative regret over all episodes via explicit bounds on the probability of the regret exceeding a specific value, going beyond analyses limited to expected regret or a single high-probability quantile. We analyze two natural exploration-bonus schedules for UCBVI: (i) a schedule dependent on the total number of episodes that explicitly incorporates this total, and (ii) an anytime schedule (independent of the total) that depends only on the current episode index. We then complement the model-based results with an analysis of optimistic Q-learning (model-free) under a schedule dependent on the total number of episodes.
Across both the model-based and model-free settings, we obtain upper bounds on the probability of the regret exceeding a given value, revealing a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected cumulative regret. The proposed algorithms depend on a tuning parameter, alpha, which balances the expected regret and the range over which the regret exhibits sub-Gaussian decay.
Bio: Mehrdad Moharrami is an Assistant Professor in Computer Science at the University of Iowa. Previously, he was a TRIPODS Postdoctoral Research Fellow at the University of Illinois at Urbana Champaign. He received his B.Sc. from the Sharif University of Technology, Iran, in Mathematics, as well as Electrical Engineering. He holds an M.Sc. in Electrical Engineering and an M.Sc. in Mathematics from the University of Michigan. He received his Ph.D. in Electrical Engineering in Winter 2020, for which he was awarded the Rackham Predoctoral Fellowship for an outstanding dissertation. His research interests are Markov decision processes, reinforcement learning, and random graph models for economics, learning, and computation.