Decentralized Reinforcement Learning

Global Decision Making via Local Economic Transactions

Michael Chang, Sidhant Kaushik, S. Matthew Weinberg, Thomas L. Griffiths, Sergey Levine

The paper was presented at ICML 2020. Check out the talk, code, and blog.

What if an agent were a society of more primitive agents?

From corporations to organisms, many large scale systems in our world are composed of smaller individual working components, whose collective function serves to complete a larger objective. We can therefore view these systems as a group of entities with their own simpler/smaller objectives. Thus the complex behavior of the larger system emerges from the optimization of these individual objectives. While many of the current learning methods in Artificial Intelligence (more specifically Reinforcement Learning) solve the problem of learning certain tasks or behaviors using a single large scale learner parametrized by deep neural networks, in this paper we investigate the characterization of the learning problem as a society of learners that would allow for the development of a hierarchy of task complexity.

We develop the societal decision-making framework in which a society of primitive agents buy and sell to each other the right to operate on the environment state in a series of auctions.

We prove that the Vickrey auction mechanism can be adapted to incentive the society to collectively solve MDPs as an emergent consequence of the primitive agents optimizing their own auction utilities.

We propose a class of decentralized reinforcement learning algorithms for training the society that uses credit assignment that is local in space and time.

The societal decision-making framework and decentralized reinforcement learning algorithms can be applied not only to standard reinforcement learning, but also for:

More efficient transfer learning

We test the transfer learning capability of our method in comparison to Monolithic HRL using the Gym-Minigrid environment on the right. The pre-training task requires the agent to navigate to the green goal square, while in the transfer the agent is rewarded for reaching the blue goal square. Both the Monolithic and our Decentralized HRL methods are provided with three primitives with the following objectives:

Go to and open red door
Go to Green Goal
Go to Blue Goal

The policies for these primitives have been pre-trained using Proximal Policy Optimization (Schulman et al. (2017).

Pre-Train (Green Goal)

Transfer (Blue Goal)

Selecting options in semi-MDPS

We observe evidence that suggests the potential for decentralized reinforcement learning to offer benefit in transferring to new tasks. Credit Conserving Vickrey Cloned is an instantiation of our method, which is learns faster in both the pre-training task and the transfer task than a monolithic baseline that directly optimizes for the MDP objective.

Dynamically composing computation graphs

The society can also learn to dynamically select computations in a computation graph. In the Mental Rotation task on the right (adapted from Chang et al. (2019), the society learns to classify transformed MNIST digits correctly by composing a sequence of affine transformations to re-represent the input in a form that can be classified correctly by a pre-trained MNIST classifier.