Graph Reinforcement Learning for Network Control via Bi-Level Optimization

ICML 2023

arxiv | code | video

TL;DR: a graph network-based bi-level approach that leverages the strengths of direct optimization and reinforcement learning to solve network optimization problems.


Daniele Gammelli1James Harrison2, Kaidi Yang3, Marco Pavone1, Filipe Rodrigues4, Francisco C. Pereira4

1 Stanford University, 2 Google Brain, 3National University of Singapore, 4Technical University of Denmark

Abstract

Optimization problems over dynamic networks have been extensively studied and widely used in the past decades to formulate numerous real-world problems. However, (1) traditional optimization-based approaches do not scale to large networks, and (2) the design of good heuristics or approximation algorithms often requires significant manual trial-and-error. In this work, we argue that data-driven strategies can automate this process and learn efficient algorithms without compromising optimality. To do so, we present network control problems through the lens of reinforcement learning and propose a graph network-based framework to handle a broad class of problems. Instead of naively computing actions over high-dimensional graph elements, e.g., edges, we propose a bi-level formulation where we (1) specify a desired next state via RL, and (2) solve a convex program to best achieve it, leading to drastically improved scalability and performance. We further highlight a collection of desirable features to system designers, investigate design decisions, and present experiments on real-world control problems showing the utility, scalability, and flexibility of our framework.

Motivation

Our Framework

Instead of naively computing actions over graph elements, we first specify a desired next state through RL, and then solve a convex program to compute the graph actions that can best achieve it.

Specifically, we consider a bi-level formulation by replacing a single policy that maps from states to actions (i.e., s_t → a_t) with two nested policies, mapping from states to desired next states to actions (i.e., s_t → s_t+1 → a_t ): this leads to drastic improved sample efficiency and performance!


Empirical Results

(1) On simulated environments, our formulation is able to operate reliably within a broad set of situations, ranging from scenarios characterized by dynamic travel times (Dyn. travel time), dynamic topologies, i.e., with nodes and edges that can be removed or added during an episode (Dyn. topology), capacitated networks (Capacity) with different depth (2-hop, 3-hop, 4-hop), and multi-commodity problems (Multi-commodity).

All of this while mainting scalability compared to optimization-based approaches

(2) On real-world problems, we cast the Supply Chain Inventory Management and Dynamic Vehicle Routing problems as RL problems under our framework, and show how our method substantially outperforms  both optimization-based and RL-based approaches 

(3) Distance metric as Value Function

We show how the distance metric (and the generated desired next state) can be interpreted as a way to capture the value of future reward in the greedy one-step inner optimization problem, ultimately allowing for implicit long-term planning.