From corporations to organisms, many large scale systems in our world are composed of smaller individual working components, whose collective function serves to complete a larger objective. We can therefore view these systems as a group of entities with their own simpler/smaller objectives. Thus the complex behavior of the larger system emerges from the optimization of these individual objectives. While many of the current learning methods in Artificial Intelligence (more specifically Reinforcement Learning) solve the problem of learning certain tasks or behaviors using a single large scale learner parametrized by deep neural networks, in this paper we investigate the characterization of the learning problem as a society of learners that would allow for the development of a hierarchy of task complexity.
We test the transfer learning capability of our method in comparison to Monolithic HRL using the Gym-Minigrid environment on the right. The pre-training task requires the agent to navigate to the green goal square, while in the transfer the agent is rewarded for reaching the blue goal square. Both the Monolithic and our Decentralized HRL methods are provided with three primitives with the following objectives:
Go to and open red door
Go to Green Goal
Go to Blue Goal
The policies for these primitives have been pre-trained using Proximal Policy Optimization (Schulman et al. (2017).
We observe evidence that suggests the potential for decentralized reinforcement learning to offer benefit in transferring to new tasks. Credit Conserving Vickrey Cloned is an instantiation of our method, which is learns faster in both the pre-training task and the transfer task than a monolithic baseline that directly optimizes for the MDP objective.
The society can also learn to dynamically select computations in a computation graph. In the Mental Rotation task on the right (adapted from Chang et al. (2019), the society learns to classify transformed MNIST digits correctly by composing a sequence of affine transformations to re-represent the input in a form that can be classified correctly by a pre-trained MNIST classifier.