Learning Robust Real-Time Cultural Transmission without Human Data

Learning Few-Shot Imitation as Cultural Transmission:
Robust Real-Time Cultural Transmission without Human Data

Supplementary Material

DeepMind Cultural General Intelligence Team

Avishkar Bhoopchand, Bethanie Brownfield, Adrian Collister, Agustin Dal Lago, Ashley Edwards, Richard Everett, Alexandre Fréchette, Edward Hughes, Kory W. Mathewson, Piermaria Mendolicchio, Yanko Oliveira, Julia Pawar, Miruna Pîslar, Alex Platonov, Evan Senter, Sukhdeep Singh, Alexander Zacherl, Lei M. Zhang

The following footage is provided as supplementary material for the paper. Please see the paper for detailed information on the agent, environment and goals of the project.

Contents

MEDAL-ADR - Agent Behaviour

Human Trajectory Probe Tasks

Behaviour Over Time During Training

Effects of ADR During an Experiment

Solitary Agent Behavior (M-----)

Generalisation: World Space

Generalisation: Game Space

Generalisation: Expert Space

Recall Analysis

Trajectory Plot Examples

The LIDAR Sensor

The Environment at a Glance

MEDAL-ADR - Agent Behaviour

Our cultural transmission agent (MEDAL-ADR, in blue) finds an expert (in red) in a held-out task, follows it on a path through goals while navigating terrain and obstacles, and continues to display the demonstrated trajectory within the same episode after the expert has dropped out. Note that the trails left behind by both avatars are not visible to the agent, they just make these videos easier to follow for human observers.

Expert Bot

Expert Human

Human Trajectory Probe Tasks

One example of each type of probe task. All probes aim to cover a wide, representative range of crossings and color combinations (as much possible with only ten probes). In addition, the complex world probes all aim at giving clean demonstrations of jumping and/or crouching behaviors and navigation around vertical obstacles. The human movement pattern in all probes is always goal-directed and near-optimal (it does not incur any score penalties) but clearly different from a scripted bot, taking a time to get situated in the first few seconds and not always taking the exact same path twice. In all videos our agent is marked blue, the expert in red.

Empty world, 4-goal game
Full expert demonstration

Empty world, 4-goal game
Expert drops out half-way

Empty world, 4-goal game
No expert demonstration

Empty world, 5-goal game
Full expert demonstration

Empty world, 5-goal game
Expert drops out half-way

Empty world, 5-goal game
No expert demonstration

Complex world, 5-goal game
Full expert demonstration

Complex world, 5-goal game
Expert drops out half-way

Complex world, 5-goal game
No expert demonstration

Behaviour Over Time During Training

Progression of emergent social-learning behaviors during training in a fixed environment at certain times during training (in billion steps). In all videos our agent is marked blue, the expert in red.

Training steps: 8.6 bn
Initial Exploration

Training steps: 15.9 bn
Following

Training steps: 18.2 bn
Memorization

Training steps: 26.7 bn
Independence

Effects of ADR During an Experiment

Accumulation of navigation skills of social-learning agent under ADR training. From left to right: horizontal obstacles, horizontal obstacles in a larger world, horizontal and vertical obstacles, horizontal and vertical obstacles over bumpy terrain. In all videos our agent is marked blue, the expert in red.

Horizontal obstacles

+ Larger world

+ Vertical obstacles

+ Bumpy terrain

Solitary Agent Behavior (M-----)

By removing expert demonstrations -- and, consequently, all dependent components, the dropout (D) and attention loss (AL) -- the agent remains alone to determine the correct goal sphere ordering. Under the default exploration strategy built into MPO, the solitary agent quickly becomes too risk-averse to solve the task and learns to avoid any goal spheres, not making use of any expert demonstrations at test time either.

Generalisation: World Space

MEDAL-ADR generalises across world space parameters, demonstrating both following and recall across much of the space. The space of worlds is parameterised by the size and bumpiness of the terrain, and the density of obstacles. To quantify generalisation over this space, we generate tasks with worlds from the Cartesian product of obstacle density and world size, with a perfect expert bot and game uniformly sampled with 2 crossings and 5 spheres. In all videos our agent is marked blue, the expert in red.

Obstacle Complexity: 0.0
Terrain Complexity: 0.0

Obstacle Complexity: 1.0
Terrain Complexity: 0.0

Obstacle Complexity: 0.0
Terrain Complexity: 1.0

Obstacle Complexity: 1.0
Terrain Complexity: 1.0

Generalisation: Game Space

The space of games is defined by the number of goals in the world as well as the number of crossings contained in the correct navigation path between them. To quantify generalisation over this space, we generate tasks across the range of feasible “N-goal M-crossing” games, with a perfect expert bot, and a flat, empty world of size 20x20. In all videos our agent is marked blue, the expert in red.

Spheres: 4
Crossings: 1

Spheres: 5
Crossings: 2

Spheres: 6
Crossings: 4

Generalisation: Expert Space

The space of experts is defined by the speed and action distribution taken by the expert in the world. Such experts can be either scripted bots, allowing us to precisely control their movement speed and action noise, or human players with more realistic and diverse movement patterns. To quantify generalisation over this space, we generate tasks with experts bots from the Cartesian product of movement speed and action noise. In all videos our agent is marked blue, the expert in red.

Bot Noise: 0.0
Bot Max Speed: 3.0

Bot Noise: 0.5
Bot Max Speed: 13.0

Bot Noise: 0.0
Bot Max Speed: 17.0

Recall Analysis

We observe that MEDAL-ADR is able to sustain some performance even in much longer episodes than what is in distribution for training (1800 steps). Here an example of a 3600 step episode with the expert bot dropped out after 900 steps.

Trajectory Plot Examples

Trajectory plots for MEDAL-ADR agent for a single episode. The coloured parts of the lines correspond to the colour of the goal hemisphere the agent and expert have entered and the xs correspond to when the agent entered the incorrect goal. Here, position refers to the agent’s position along the z-axis. In all videos our agent is marked blue, the expert in red.

a) The bot is absent for the whole episode.

b) The bot shows a correct trajectory for 900 steps followed by an incorrect trajectory.

c) The bot shows a correct trajectory in the first half of the episode and then drops out.

d) The bot shows an incorrect trajectory in the first half of the episode and then drops out.

The LIDAR Sensor

The LIDAR sensor rays emanating from an avatar, showing only rays that collide with an object in the world.

Google Sites

Report abuse