Less is more

The Dispatcher / Executor Principle for Multi-Task RL

Martin Riedmiller, Tim Hertweck, Roland Hafner        

Google DeepMind

Abstract

In this work we introduce the dispatcher/ executor principle for the design of multi-task Reinforcement Learning controllers. It aims at boosting generalisation by partitioning the controller in two entities based on the type of knowledge they require: one that requires general world knowledge to understand the task (the 'dispatcher') and one that requires device specific knowledge to compute the controls (the 'executor') - and to connect these two by a strongly regularizing communication channel. The core rationale behind this position paper is that changes in structure and design principles can improve generalisation properties and drastically enforce data-efficiency. It is in some sense a ’yes, and ...’ response to the current trend of using huge neural networks trained on vast amounts of data and bet on emerging generalisation properties. While we agree on the power of scaling - in the sense of Sutton's 'bitter lesson' - we will give evidence, that considering structure and adding design principles can be a valuable and critical component in particular when device-dependent real world experience is not abundant and infinite, but is a precious resource.

This position paper makes the following contributions:

Approach

The dispatcher/ executor (D/E) principle for multi-task reinforcement learning suggests a) the separation into a dispatcher module that contains general world knowledge and understands the task, b) an executor module that learns the specific interaction with a particular device and c) a regularizing communication channel between these two modules that allows for abstract and compositional communication. As a proof of concept we suggest a concrete D/E architecture implementation for learning robot manipulation (see figure below). The dispatcher receives the raw observations consisting of camera images, proprioception, other sensor values and the task specification. Depending on the task, it selects an executor and communicates restricted information about task and scene to the executor. To illustrate the principle we suggest to use a masking operation for identifying target objects and an edge operator for transferring basic general information about the scene. The executor trains with that input to achieve the communicated goal. The massive reduction of irrelevant information leads to faster learning and a massive boost in generalisation.

Figure 1: The dispatcher/ executor architecture for learning to lift an object.

Results

The D/E architecture can be trained by various methods, e.g. RL from scratch (see paper). Here we demonstrate the benefit of the D/E principle to distill a pre-trained single task policy into a more general multi-task policy. As a teacher, we use a stacking policy for stacking 'red on blue', that was trained by Reinforcement Learning from experience directly on the real robot. Although very perforrmant (96% success rate, see diagram below), this policy was trained with a standard neural network controller and can only solve this single task. Next, we distilled this policy into a D/E architecture. As shown in the diagram below, only little performance is lost for stacking 'red on blue'  (89% success rate, leftmost pillar). The crucial point is, that this policy now generalises without any further learning also to all other potential color combinations, i.e. stack blue on green, green on red, ...  (see video 2). Also, the dispatcher can now sequence a number of executor calls, leading to some infinite playing (see video 3). Novel tasks like building two towers, a three-stack or putting objects in a bowl can be achieved by changing the dispatcher accordingly (videos 4-6). No further training of the low-level executor is required - it just needs to be communicated the 'right' message what to do. 

This teacher/ student scenario shows one way to go from a single task policy trained in a restricted setup to a versatile multi-task controller. The single-task teacher policy has been trained entirely by Reinforcemet Learning on the real robot. Massive generalisation effects are achieved through the dispatcher that understands the tasks, abstracts away irrelevant information ('less is more') and calls the executor accordingly.

Video 1: The controller learned the single task 'stack red on blue' by reinforcement learning from scratch on the real robot.

Video 2: After transferring the single task teacher policy to the D/E executor, D/E is immediately able to stack arbitrary objects from the RGB sets.

Figure 2: Real robot stacking. The teacher policy (blue) was originally trained to stack red on blue objects. After distillation, the D/E controller transfers zero-shot to other tasks/ color combiniations with strong success rates (orange), whereas the original single task controller (blue) fails completely. The variance in performance stems from the fact, that the object shapes for the different colors vary quite significantly, e.g. stacking green on red objects is significantly more difficult due to the shapes of the respective objects.

Zero-shot generalisation to further novel tasks

Using the above general (stacking) executor, new behaviours can be generated by modifying the dispatcher. The D/E approach is even robust to new objects in the scene due to the enforced restricted communication between dispatcher and executor (videos 5 and 6).

Video 3: Playful continuous stacking of cubes.

Video 4: building a tower
by dispatcher communication
'blue on green', 'red on blue'

Video 4: building two towers
by dispatcher communication
'red on blue', 'green on yellow'

Video 6: putting 2 objects in bowl by dispatcher  communication
'green on blue', 'red on blue'

Multi-task Reinforcement Learning from scratch: x times speedup

Figure 3: Multi-task Reinforcement Learning from scratch to learn to lift cubes in three different colors (in simulation). Using the same amount of training data (20k episodes), the D/E architecture has learned to perfectly lift each cube in any color (blue pillar), whereas the standard architecture still performs poorly (orange pillar). The standard architecture only catches up after 3 times as much training (60k episodes, green pillar). The application of the D/E principle to the design of the learning architecture can be significantly more data-efficient.