Workshop on Reinforcement Learning and Multi-Agent Systems

In recent years, learning techniques like reinforcement learning have found numerous applications in optimal control of multi-agent systems, including queueing networks. However, despite the well-developed theory of learning in single-agent settings, the theoretical foundation of learning in multi-agent environments is still evolving. Further research is needed to establish a solid theoretical framework building on ideas from different research communities that will then spur future breakthroughs. By bringing together at a workshop individuals working on learning theory, decision-making and real-world applications of multi-agent systems, we aim to provide a venue and a collaborative environment that encourages the sharing of ideas and the exploration of innovative approaches so that progress can be made on the real-world problems that motivate such research.

Held in conjunction with Performance 2023, the workshop will feature invited speakers on three topics: 1) reinforcement learning and other learning-based control methodologies, 2) decision-making in multi-agent systems, and 3) learning-based control of queuing networks. The objective is to bring together experts from different communities working on these three topics to present their latest findings and to discuss problems at their intersection and foster collaboration.

Organizers: Mehrdad Moharrami (University of Iowa), Vijay Subramanian (University of Michigan)

Date: November 17th, 2023

Location: Wieboldt Hall in Northwestern University's downtown campus, room number 307.

Schedule at a glance:

1- Qiaomin Xie: Faking a Nash Equilibrium in Markov Game, 9:00 AM - 9:30 AM

2- Amy R. Ward: Learning the Scheduling Policy in a Many Server Queue with Abandonment, 9:30 AM - 10:00 AM

3- Vijay Kamble: Incentives for Exploration at Market Equilibrium, 10:00 AM - 10:30 AM

4- Baris Ata: Singular Control of RBM in an Orthant: A Computational Method Based on Neural Networks, 10:30 AM - 11:00 AM

5- Ermin Wei: Incentivized Federated Learning and Unlearning, 11:00 AM - 11:30 AM

6- Mehrdad Moharrami: Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms, 11:30 AM - 12:00 PM

Lunch Break

7- Yuan Zhong: Instability and stability of parameter agnostic policies in parallel server systems, 1:30 PM - 2:00 PM

8- Vijay Subramanian: Bayesian Learning of Optimal Policies in Markov Decisions Processes with Countably Infinite State-Space, 2:00 PM - 2:30 PM

9- Avrim Blum: Eliciting User Preferences for Personalized Multi-Objective Decision Making, 2:30 PM - 3:00 PM

10- Harsha Honnappa: Towards a Theory of Continuous-Time Reinforcement Learning in General Stochastic Environments, 3:00 PM - 3:30 PM

11- Chang-Han Rhee: Unbiased Gradient Estimation for Steady-State Performance Measures of Markov Chains, 3:30 PM - 4:00 PM

12- R. Srikant: On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation, 4:00 PM - 4:30 PM

Qiaomin Xie

Faking a Nash Equilibrium in Markov Game

Co-Authors: Young Wu, Jeremy McMahan, Yiding Chen, Yudong Chen, and Xiaojin Zhu

Time: 9:00 AM - 9:30 AM

Abstract and Bio

Abstract: We study the game modification problem, where either a benevolent game designer or a malevolent adversary modifies the payoff matrixes of a two-player zero-sum Markov game so that a target strategy profile having a value within a target range becomes the unique Nash equilibrium, in a way that also minimizes the modification cost. We provide a characterization of the set of pure or mixed strategy profiles that can be successfully installed as the unique Nash equilibrium, and we propose an efficient linear program with random perturbation to obtain a feasible solution with probability one and achieve an objective close to the one for the original problem. Our results can be applied to characterize offline data poisoning attacks in Multi-Agent Reinforcement Learning, where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Nash equilibrium. We exhibit a linear program to efficiently compute the optimal poisoning attack.

Bio: Qiaomin Xie is an assistant professor in Department of Industrial and Systems Engineering (ISyE) at University of Wisconsin-Madison. She received her Ph.D. degree in ECE from University of Illinois Urbana Champaign in 2016. She was a postdoctoral researcher at MIT (2017-2018) and worked as a visiting assistant professor at Cornell University (2019-2021). She is the recipient of JP Morgan Faculty Research Award (2021), Google Systems Research Award (2020) and UIUC CSL PhD Thesis Award (2017). Her research interests include reinforcement learning, queueing theory, stochastic systems and game theory, with applications to computer and communication networks.

Amy R. Ward

Learning the Scheduling Policy in a Many Server Queue with Abandonment

Co-Authors: Yueyang Zhong and John R. Birge

Time: 9:30 AM - 10:00 AM

Abstract and Bio

Abstract: The multiclass many server queue with abandonment (specifically, the multiclass GI/GI/N+GI queue) is a canonical model for service systems. One key operational question is how to schedule; that is, how to choose the customer that a newly available server will serve. The scheduling question is of fundamental importance because scheduling determines which customer classes have longer wait times (relative to their patience when waiting for service), and, therefore, more abandonments. However, even though there is much work on scheduling in queueing systems, there is comparatively less work on scheduling in queueing systems when the model primitives (that is, distributional and parameter information regarding the inter-arrival, service, and patience times) are unknown and may be learned.

Our objective is to determine a scheduling policy that minimizes regret, which is the difference in expected abandonment cost between a proposed policy, that does not have knowledge of model primitives, and a benchmark policy, that has full knowledge of model primitives. The difficulty is that the state space is very complex, because in order for the system to be Markovian, we must track: (i) the time elapsed since the last arrival for each class; (ii) the amount of time each customer in service has been in service; and (iii) the amount of time each customer in queue has spent waiting. We propose a policy that first learns and then schedules following a Learn-then-Schedule (LTS) algorithm that we develop. We analyze the performance of the LTS policy against the benchmark aμ-rule (that prioritizes classes for service in accordance with their cost of abandonment times service rate). The algorithm is composed of a learning phase, during which empirical estimates of the service rates are formed, and an exploitation phase, during which an empirical aμ-rule based on those estimates is applied. We show that the LTS policy has regret of order log T (where T is the system run-time), which is the smallest order achievable.

Bio: Amy is a fellow of the INFORMS Manufacturing and Service Operations Management (M&SOM) Society (elected June, 2023). Amy is the incoming Editor-in-Chief for the journal Operations Research (term begins 1/1/2024), and the outgoing Editor-in-Chief for the journal Operations Research Letters (term 4/1/2021-12/30/2023). Prior to that, Amy held the position of Chair of the INFORMS Applied Probability Society (term 11/2016-11/2018).

Vijay Kamble

Incentives for Exploration at Market Equilibrium

Co-Author: Eren Ozbay

Time: 10:00 AM - 10:30 AM

Abstract and Bio

Abstract: Online marketplaces face an exploration problem: the qualities of new supply units are unknown and must be discovered through customer feedback so that higher quality supply gets prioritized for matching. However, customers are generally myopic and unwilling to participate in exploratory matches. This paper uncovers the role of congestion and the resulting market equilibrium prices in creating incentives for such exploratory behavior among myopic customers. The intuition is that since customers prefer established higher-quality supply units, these units naturally demand higher equilibrium prices due to congestion compared to newer units, effectively incentivizing customers to explore. This paper presents a comprehensive analysis of the extent to which this intuition holds and the extent to which exogenous incentives for exploration are necessary for such markets.

To investigate this question, we define a novel equilibrium notion for such markets where the public information about supply units evolves over time. Our main result shows that under a market regularity condition, the ratio of the equilibrium matching value and the system optimal value under centralized matching is bounded below by the aggregate congestion level, i.e., the ratio of demand and aggregate supply. In a nutshell, congested regular markets organically incentivize exploration. Further, for markets with a certain linear information transition structure, the market equilibrium exactly achieves the system-optimal value regardless of the congestion level. Our results inform market designers grappling with the concern of incentivizing exploration in various online marketplaces, shedding light on when interventions may be (un)necessary.

Bio: Vijay Kamble is an Assistant Professor in the Department of Information of Decision Sciences with a courtesy appointment in the Department of Computer Science at the University of Illinois Chicago. Before joining UIC in 2017, he was a postdoctoral scholar in the Society and Algorithms Lab at the Management Science and Engineering Department of Stanford University. He obtained his Ph.D. in Electrical Engineering and Computer Sciences from the University of California Berkeley in 2015. His research is centered on the design and optimization of online platforms and marketplaces, in particular characterizing algorithmic and design tradeoffs at the interfaces between learning, incentives, and the concerns of fairness and ethics in such platforms.

Baris Ata

Singular Control of RBM in an Orthant: A Computational Method Based on Neural Networks

Co-Authors: Michael Harrison and Nian Si

Time: 10:30 AM - 11:00 AM

Abstract and Bio

Abstract: Motivated by applications in queueing theory, we consider a class of singular stochastic control problems whose state space is the d-dimensional positive orthant. The original problem is approximated by a drift control problem, to which we apply a recently developed computational method that is feasible for dimensions up to d=30 or more. To show that nearly optimal solutions are obtainable using this method, we present computational results for a variety of examples, including queueing network examples that have appeared previously in the literature.

Bio: Baris Ata is the Sigmund E. Edelstone Distinguished Service Professor of Operations Management at The University of Chicago, Booth School of Business. He received the B.S. degree in Industrial Engineering from Bilkent University (TURKEY) in 1997, and MS degrees in Engineering-Economic Systems and Operations Research (1999), Business Research (2000), Mathematics (2001), and Statistics (2002) and a Ph.D. in Operations, Information, and Technology (2003) from Stanford University.

His current research interests include operational aspects of organ transplantation, criminal justice system, global health, as well as theoretical and computational study of stochastic networks and their applications. His research has been recognized by the Best Paper in Service Science Award, INFORMS (2009), William Pierskalla Best Paper Award, INFORMS (2015) and Wickham Skinner Best Paper Award, POMS (2019). He is also a recipient of the Manufacturing and Service Operations Management Young Scholar Prize, INFORMS (2015) and Emory Williams MBA (school wide) Teaching Award at Chicago Booth (2021).

Ermin Wei

Incentivized Federated Learning and Unlearning

Co-Authors: Ningning Ding and Randy Berry

Time: 11:00 AM - 11:30 AM

Abstract and Bio

Abstract: To protect users' right to be forgotten in federated learning, federated unlearning aims at eliminating the impact of leaving users' data on the global learned model. The current research in federated unlearning mainly concentrated on developing effective and efficient unlearning techniques. However, the issue of incentivizing valuable users to remain engaged and preventing their data from being unlearned is still under-explored, yet important to the unlearned model performance. This work focuses on the incentive issue and develops an incentive mechanism for federated learning and unlearning. We first characterize the leaving users' impact on the global model accuracy and the required communication rounds for unlearning. Building on these results, we propose a four-stage game to capture the interaction and information updates during the learning and unlearning process. A key contribution is to summarize users’ multi-dimensional private information into one-dimensional metrics to guide the incentive design.

We further investigate whether allowing federated unlearning is beneficial to the server and users, compared to a scenario without unlearning. Interestingly, users usually have a larger total payoff in the scenario with higher costs, due to the server’s excess incentives under information asymmetry. The numerical results demonstrate the necessity of unlearning incentives for retaining valuable leaving users, and also show that our proposed mechanisms decrease the server's cost by up to 53.91% compared to state-of-the-art benchmarks. This is a joint work with Ningning Ding and Randy Berry.

Bio: Ermin Wei is an Associate Professor at the Electrical and Computer Engineering Department and Industrial Engineering and Management Sciences Department of Northwestern University. She completed her PhD studies in Electrical Engineering and Computer Science at MIT in 2014, advised by Professor Asu Ozdaglar, where she also obtained her M.S. She received her undergraduate triple degree in Computer Engineering, Finance and Mathematics with a minor in German, from University of Maryland, College Park. Her team won the 2nd place in the GO-competition Challenge 1, an electricity grid optimization competition organized by Department of Energy. Wei's research interests include distributed optimization methods, convex optimization and analysis, smart grid, communication systems and energy networks and market economic analysis.

Mehrdad Moharrami

Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Co-Authors: Yashaswini Murthy and R. Srikant

Time: 11:30 AM - 12:00 AM

Abstract and Bio

Abstract: Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to one which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the asymptotic error goes to zero in the limit as policy evaluation and policy improvement errors go to zero.

Bio: Mehrdad Moharrami is an Assistant Professor in Computer Science at the University of Iowa. Previously, he was a TRIPODS Postdoctoral Research Fellow at the University of Illinois at Urbana Champaign. He received his B.Sc. from the Sharif University of Technology, Iran, in Mathematics, as well as Electrical Engineering. He holds an M.Sc. in Electrical Engineering and an M.Sc. in Mathematics from the University of Michigan. He received his Ph.D. in Electrical Engineering in Winter 2020, for which he was awarded the Rackham Predoctoral Fellowship for an outstanding dissertation. His research interests are Markov decision processes, reinforcement learning, and random graph models for economics, learning, and computation.

Yuan Zhong

Instability and stability of parameter agnostic policies in parallel server systems

Co-Authors: Gorkem Unlu

Time: 1:30 PM - 2:00 PM

Abstract and Bio

Abstract: We analyze the stability properties of the X-Model parallel server system under parameter agnostic policies. These policies are appealing in real-world scenarios as they solely rely on queue size information, eliminating the need for knowledge about system parameters. However, we show that such policies can result in instability. Our analysis focuses on parameter agnostic switching curve policies, wherein each server’s service decision is determined by a non-decreasing function of queue sizes. We demonstrate that even at relatively low system loads, these policies can lead to instability. We explore various classes of parameter agnostic policies and characterize the regions of instability whenever possible.

Bio: Yuan Zhong is an associate professor of operations management at University of Chicago, Booth School of Business. His main research area is applied probability, focusing on the modeling and analysis of large-scale stochastic systems, with applications to cloud computing, manufacturing and e-commerce logistics. Prior to Booth, Yuan was an assistant professor at Columbia University in the Department of Industrial Engineering and Operations Research. Yuan received a PhD in operations research from MIT in 2012. He received the best student paper award at the ACM Sigmetrics conference in 2012.

Vijay Subramanian

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Co-Authors: Saghar Adler

Time: 2:00 PM - 2:30 PM

Abstract and Bio

Abstract: Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter theta from a parameter set Theta, and defined on a countably-infinite state space the d-dimensional lattice of non-negative integers, with finite action space, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter theta* generated via a given fixed (full-support) prior distribution on Theta. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an sub-linear upper bound on the Bayesian regret of our algorithm in terms of the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.

Bio: Vijay Subramanian received the Ph.D. degree in electrical engineering from the University of Illinois at Urbana-Champaign, Champaign, IL, USA, in 1999. He worked at Motorola Inc., at the Hamilton Institute, Maynooth, Ireland, for many years, and also in the EECS Department, Northwestern University, Evanston, IL, USA. In Fall 2014, he started in his current position as an Associate Professor with the EECS Department at the University of Michigan, Ann Arbor. For the academic year 2022-2023, he was an Adjunct Research Associate Professor in CSL and ECE at UIUC. His research interests are in stochastic analysis, random graphs, multi-agent systems, and game theory (mechanism and information design) with applications to social, economic and technological networks.

Avrim Blum

Eliciting User Preferences for Personalized Multi-Objective Decision Making

Co-Authors: Han Shao, Lee Cohen, Yishay Mansour, Aadirupa Saha, and Matthew R. Walter

Time: 2:30 PM - 3:00 PM

Abstract and Bio

Abstract: In this work, we propose a multi-objective decision making framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons. Our model consists of a known Markov decision process with a vector-valued reward function, with each user having an unknown preference vector that expresses the relative importance of each objective. The goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models: one where the user is provided two policies and returns their preferred policy as feedback, and a second where the user is instead provided two small weighted sets of representative trajectories and selects the preferred set. In both cases, we propose an algorithm that finds a near-optimal policy for the user using a number of comparison queries that scales quasilinearly in the number of objectives.

Bio: Avrim Blum is Professor and Chief Academic Officer at the Toyota Technological Institute at Chicago (TTIC); prior to this he was on the faculty at Carnegie Mellon University for 25 years. His main research interests are in Machine Learning Theory, Algorithmic Game Theory, Privacy, and Algorithmic Fairness. He has served as Program Chair for the Conference on Learning Theory (COLT), the IEEE Symposium on Foundations of Computer Science (FOCS), and the Innovations in Theoretical Computer Science Conference (ITCS). Blum is recipient of the AI Journal Classic Paper Award, the ICML/COLT 10-Year Best Paper Award, the ACM Paris Kanellakis Award, the Sloan Fellowship, the NSF National Young Investigator Award, and the Herbert Simon Teaching Award, and he is a Fellow of the ACM.

Harsha Honnappa

Towards a Theory of Continuous-Time Reinforcement Learning in General Stochastic Environments

Co-Authors: Prakash Chakraborty and Samy Tindel

Time: 3:00 PM - 3:30 PM

Abstract and Bio

Abstract: Reinforcement learning in continuous time is a suitable learning model for agents interacting with a stochastic environment at ultra-high frequency. However, most of the existing literature on continuous time RL focuses on semimartingale or Markovian environments, which do not accurately represent many real-world ultra-high frequency applications. For example, the volatility of stock market returns can be appropriately modeled as fractional Brownian motion (fBM), and controlled stochastic networks with heavy-tailed service in heavy-traffic are well approximated by reflected fBM processes. These are but two examples across a range of applications in science and engineering.

This talk presents our recent results in developing a theoretical framework for modeling exploration processes in general continuous time stochastic environments using rough path theory. In continuous time, exploration becomes more complex since randomized policies at each time-step cannot be used to explore. Instead, actions must be continuously randomized in some way. Relaxed controls, in the sense of continuous curves of probability measures in Wasserstein space, provide a natural approach to modeling a randomized exploration process in continuous time. Therefore, we propose and analyze a pathwise relaxed control framework to model exploration in continuous-time reinforcement learning in general stochastic environments. Specifically, we establish the existence and uniqueness of the value function as a solution of a rough Hamilton-Jacobi-Bellman equation. This is our first step towards establishing a comprehensive theory of continuous-time reinforcement learning. As an immediate application of the analysis of the value function, we use it to characterize the optimal exploration-relaxed control for an entropy-regularized objective, which emulates maximum-entropy objectives in discrete-time reinforcement learning.

This talk is based on joint work with Prakash Chakraborty at Penn State and Samy Tindel at Purdue. This project is supported by the National Science Foundation through grant DMS/2153915.

Bio: Harsha Honnappa is an Associate Professor in the School of Industrial Engineering at Purdue University and a J. Tinsley Oden Visiting Faculty Fellow at the Oden Institute at The University of Texas at Austin. His research interests as an applied probabilist encompass stochastic modeling, optimization and control, with applications to machine learning, simulation and statistical inference. His research is supported by the National Science Foundation, including an NSF CAREER award, and the Purdue Research Foundation. He is an editorial board member at Operations Research, Operations Research Letters and Queueing Systems journals.

Chang-Han Rhee

Unbiased Gradient Estimation for Steady-State Performance Measures of Markov Chains

Co-Authors: Jeffrey Wang

Time: 3:30 PM - 4:00 PM

Abstract and Bio

Abstract: In this talk, we will discuss a new approach to unbiased estimation of gradients of the steady-state expectations associated with parameterized families of Markov chains. Our estimators are based on constructing a coupling between two Markov chains and require only an oracle to evaluate the transition density and its gradient at given data points. Our new estimators are particularly efficient when the Markov chains have long regeneration cycles, which we will illustrate with numerical experiments. We will then explore connections to average-reward reinforcement learning problems.

Bio: Chang-Han Rhee is an Assistant Professor in Industrial Engineering and Management Sciences at Northwestern University. Before joining Northwestern University, he was a postdoctoral researcher at Centrum Wiskunde & Informatica and Georgia Tech. He received his Ph.D. from Stanford University. His research interests include applied probability, stochastic simulation, experimental design, and the theoretical foundation of machine learning. His research has been recognized with the 2016 INFORMS Simulation Society Outstanding Publication Award, the 2012 Winter Simulation Conference Best Student Paper Award, the 2023 INFORMS George Nicholson Student Paper Competition (2nd place), and the 2013 INFORMS George Nicholson Student Paper Competition (finalist). Since 2022, his research has been supported by the NSF CAREER Award.

R. Srikant

On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation

Co-Authors: Anna Winnicki

Time: 4:00 PM - 4:30 PM

Abstract and Bio

Abstract: A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy. We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error. Joint work with Anna Winnicki.

Bio: R. Srikant is a Grainger Chair in Engineering and Professor of Electrical and Computer Engineering and the Coordinated Science Lab at the University of Illinois at Urbana-Champaign. His research interests span applied probability, machine learning and communication networks.