Sep 19th, 2022

WiOpt ReStoq Workshop

Welcome to the Workshop on Reinforcement Learning and Stochastic Control in Queues and Networks (ReStoq) on Sept 19, 2022, held in coordination with WiOpt 2022. This workshop aims to bring together researchers with expertise in reinforcement learning theory and others with expertise in control of queueing systems and stochastic networks. The goal is to initiate discussions between these two communities in order to bridge the challenge of incorporating model-knowledge and learnings from the control of stochastic networks into data-driven solution approaches.

The workshop will be fully virtual.

The Speakers

Dileep Kalathil

Texas A&M University

Shie Mannor

Technion

Niao He

ETH

Aditya Gopalan

Indian Institute of Science (IISc)

Qiaomin Xie

University of Wisconsin-Madison

Kuang Xu

Stanford

Adam Weirman

Caltech

Mengdi Wang

Princeton

Schedule

Session 1: Sep 19th, 8:00-10:00am Eastern Time (2:00-4:00pm Central European Time)

8:00 - 8:30am Eastern Time

Robust Reinforcement Learning using Offline Data

Dileep Kalathil (Texas A&M University)

Abstract:

The goal of robust reinforcement learning (RL) is to learn a policy that is robust against the uncertainty in model parameters. Parameter uncertainty commonly occurs in many real- world RL applications due to simulator modeling errors, changes in the real-world system dynamics over time, and adversarial disturbances. Robust RL is typically formulated as a max-min problem, where the objective is to learn the policy that maximizes the value against the worst possible models that lie in an uncertainty set. In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy. Robust RL with offline data is significantly more challenging than its non-robust counterpart because of the minimization over all models present in the robust Bellman operator. This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm. We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems.

8:30 - 9:00am Eastern Time

Reinforcement Learning for Datacenter Congestion Control

Shie Mannor (Technion, Israel)

Abstract

We approach the task of network congestion control in datacenters using Reinforcement Learning (RL). Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. Evidently, the most popular recent deployments rely on rule-based heuristics that are tested on a predetermined set of benchmarks. Consequently, these heuristics do not generalize well to newly-seen scenarios. Contrarily, we devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We overcome challenges such as partial-observability, nonstationarity, and multi-objectiveness. We further propose a policy gradient algorithm that leverages the analytical structure of the reward function to approximate its derivative and improve stability. We finally explain how the proposed framework can be implemented on a real firmware and demonstrate its superior performance in real-world data centers.

9:00 - 9:30am Eastern Time

The Complexities of Actor-Critic Methods for Regularized MDPs and POMDPs under Function Approximation

Niao He (ETH)

Abstract:

Actor-critic methods, combined with function approximation, have demonstrated impressive empirical success in solving Markov decision problems with large state spaces and even for partially observable systems. In this talk, we first present a finite-time analysis of natural actor-critic (NAC) with neural network approximation, and elaborate the roles of neural networks, regularization and optimization techniques to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. Further , we consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) and introduce a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation.

9:30 - 10:00am Eastern Time

Improper Reinforcement Learning with a Bag of Controllers

Aditya Gopalan (IISc, Banglore)

Abstract

We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, that have themselves been learned possibly in mismatched or simulated environments, to obtain a good 'meta' controller for a given target environment with relatively few trials. Applications of this paradigm include the Sim2Real problem (simulators are typically (crude) approximations to real world scenarios, so optimal strategies devised for them may need further tweaking to perform well in reality). Towards this, we propose an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers, and enjoy convergence rate guarantees to either stationary points or global minima with mild assumptions. We benchmark our algorithms on 2 experimental setups (1) the classic inverted pendulum balancing task, and (2) learning to schedule in constrained queueing networks. Our results show that that our improper policy optimization algorithm can achieve stability even when the base policies at its disposal are unstable, and when the optimal policy is one of the base controllers, our algorithm finds it. Furthermore, our algorithm is able to track the optimal mixture of base controllers even when the packet arrival rates to the queueing network vary over time.

Break: 10:00-10:30am Eastern Time (4:00-4:30pm Central European Time)

Session 2: Sep 19th, 10:30am-12:30pm Eastern Time (4:30-6:30pm Central European Time)

10:30 - 11:00am Eastern Time

Markovian Linear Stochastic Approximation: Bias and Extrapolation

Qiaomin Xie (University of Wisconsin-Madison)

Abstract

We consider linear stochastic approximation (LSA) with a constant stepsize and Markovian noise. We first establish weak convergence of the iterates to a limit and provide a non-asymptotic, geometric convergence rate. Furthermore, we provide a precise characterization of the asymptotic bias of this limit, establishing a Taylor-like, infinite series expansion thereof with respect to the stepsize. Consequently, the bias is proportional to the stepsize up to higher order terms of the stepsize. This result stands in sharp contrast for LSA with i.i.d. data, for which there is no bias.

While Polyak-Ruppert (tail) averaging reduces the variance, it does not affect the bias. The bias, however, can be reduced using Richardson-Romberg extrapolation. In particular, our infinite series expansion implies that extrapolation with any $m\ge2$ stepsizes eliminates the $m-1$ leading terms in the bias expansion and hence results in an exponentially smaller bias. As a by-product, we derive results for temporal difference (TD) learning algorithms with Markovian data and a constant stepsize.

11:00 - 11:30am Eastern Time

Nonstationary Bandit Learning via Predictive Sampling

Kuang Xu (Stanford)

Abstract:

In this talk I will discuss a recent project on designing algorithms for non-stationary multi-armed bandits. Taking the popular Thompson sampling algorithm as a starting point, I will illustrate how and why conventional thinking centered around estimating mean rewards omits a crucial aspect of active learning in a nonstationary environment, by failing to account for the future value of learned information. We propose the predictive sampling algorithm, which appropriately deprioritizes exploration for information that doest not offer long lasting benefits, and discuss both analytical and numerical results that show how the algorithm could outperform existing approaches in different settings. Joint work with Yueyang Liu and Benjamin Van Roy from Stanford University.

11:30 - 12:00pm Eastern Time

Robust Online Decision Making using Untrusted Advice

Adam Wierman (Caltech)

Abstract:

Making use of modern black-box AI tools such as RL is potentially transformational for safety-critical systems such as data centers, the electricity grid, transportation, and beyond. However, algorithms typically do not have formal guarantees on their worst-case performance, stability, or safety. So, while their performance may improve upon traditional approaches in “typical” cases, they may perform arbitrarily worse in scenarios where the training examples are not representative due to, e.g., distribution shift or unrepresentative training data, or in situations where global information is unavailable to local controllers. These represent significant drawbacks when considering the use of RL in safety-critical systems. Thus, a challenging open question emerges: Is it possible to provide guarantees that allow black-box RL tools to be used in safety-critical applications? In this talk, I will provide an overview of an emerging approach that treats RL policies as untrusted advice for use in online decision-making and then makes use of traditional algorithms and results from the online algorithms and control communities to "robustify" RL.

12:00 - 12:30pm Eastern Time

Reinforcement Learning for Network Influence Maximization and Regret Bounds

Mengdi Wang (Princeton)

Abstract:

Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops.

We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde \gO(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic and real-world networks demonstrate the efficiency of our algorithm.

Committee

Dileep Kalathil, Texas A&M University
Qiaomin Xie, UW-Madison