Generalization of Reinforcement Learners with Working and Episodic Memory

Meire Fortunato*, Melissa Tan*, Ryan Faulkner*, Steven Hansen*, Adrià Puigdomènech Badia, Gavin Buttimore, Charlie Deck, Joel Z Leibo, Charles Blundell

PsychLab

The tasks from this family are built on the open-sourced PsychLab environment, which simulates a psychology laboratory in first-person. The agent is presented with a set of one or multiple consecutive images, where each set is called a `trial'. Each episode has multiple trials.

Please refer to the YouTube video descriptions for additional per task information (e.g. details on scale and stimuli variants).

Arbitrary Visuomotor Mapping (AVM): In this task, a series of images is presented, each with an associated look-direction (e.g. up, left). The agent is rewarded if it remembers associations with specific movement patterns the next time it sees a given image in the episode. Train and holdout test levels use a different set of images (stimuli), and number of episodes per trial (scale).

Continuous Recognition: This task presents a series of images with rewards given for correctly indicating whether an image has been previously shown in the episode. Train and holdout test levels use a different set of images (stimuli), and number of episodes per trial (scale).

Change Detection: The agent sees two consecutive patterns separated by a variable delay and has to correctly indicate if the two patterns differ. Train and holdout test levels use a different color-set for the objects in the pattern (stimuli), and delay duration separating the patterns (scale).

What then Where: The agent is shown a single `challenge' MNIST digit, then an image of that digit with three other digits, each placed along an edge of the rectangular screen. It next has to correctly indicate the location of the `challenge' digit. Train and holdout test levels use different MNIST digits (stimuli), and delay duration separating `what' and `where' phases(scale).


Arbitrary Visuomotor Mapping: human agent play

Continuous Recognition: human agent play

Change Detection: human agent play

What then Where: human agent play

Spot the Difference

This tests whether the agent can correctly identify the difference between two nearly identical scenes. All the tasks in this family are variants of this basic setup, where the agent has to move from the first to the second room, with a ‘delay’ corridor in between.

Please refer to the YouTube video descriptions for additional per task information (e.g. details on scale and stimuli variants).

Spot the Difference Basic: the basic setup. The agent has to move from the first to the second room, with a ‘delay’ corridor in between where the agent is held for a few number of frames. In the second room, one object will have changed its color compared and the agent has to correctly identify it. Train and holdout test levels use a different object color set (stimuli), and number of frames in the delay period (scale).


Spot the Difference Passive: simplified version of Spot the Difference Basic. By placing Room 1’s blocks right next to the corridor entrance, we guarantee that the agent will always see them. Train and holdout test levels use a different object color set (stimuli), and number of frames in the delay period (scale).


Spot the Difference Multi-Object: in this version, there are more than two objects in each room. Train and holdout test levels use a different object color set (stimuli), and number of objects in each room (scale).


Spot the Difference Motion: Instead of differing in color between rooms, the altered block follows a different motion pattern. All objects in the rooms look the same. Train and holdout test levels use a different motion patterns (stimuli), and number of frames in the delay period (scale).

Spot the Difference Basic: task layout

Spot the Difference Passive: task layout

Spot the Diff Multi-Object: task layout

Spot the Difference Motion: task layout

Spot the Difference Basic: human agent play

Spot the Difference Passive: human agent play

Spot the Difference Multi-Object: human agent play

Spot the Difference Motion: human agent play

Goal Navigation

This task family was inspired by the Morris Watermaze (Miyake and Shah, 1999) setup used with rodents in behavioral neuroscience. The agent is rewarded every time it successfully reaches the goal; once it gets there it is respawned randomly in the arena and has to find its way back to the goal. The goal location is re-randomized at the start of episode

Please refer to the YouTube video descriptions for additional per task information (e.g. details on scale and stimuli variants).

Invisible Goal Empty Arena: The arena has no buildings, agent must navigate by skybox, the goal is not visible to the agent. Train and holdout test levels use a spawn region for the goal (stimuli), and map grid size (scale).

Invisible Goal With Buildings: there are rectangular buildings at fixed, non-randomized locations in the arena that the agent can use as reference points. The goal is not visible to the agent. Train and holdout test levels use a spawn region for the goal (stimuli), and map grid size (scale).

Visible Goal With Buildings: similar to "Invisible Goal With Buildings", except that the goal is visible as an oval object.

Visible Goal Procedural Maze: A visible goal in a procedurally generated maze. Train and holdout test levels use a spawn region for the goal (stimuli), and map grid size (scale).

Invisible Goal Empty Arena: task layout

Invisible Goal With Buildings: task layout

Visible Goal With Buildings: task layout

Visible Goal Procedural Maze: task layout

Invisible Goal Empty Arena: human agent play

Invisible Goal With Buildings: human agent play

Visible Goal With Buildings: human agent play

Visible Goal Procedural Maze: human agent play

Transitive Inference

This task tests if an agent can learn an overall transitive ordering over a chain of objects, through being presented with ordered pairs of adjacent objects

Please refer to the YouTube video descriptions for additional per task information (e.g. details on scale and stimuli variants).


Transitive inference is a form of reasoning where one infers a relation between items that have not been explicitly directly compared to each other. In humans, performance on probe pairs and anchor pairs with symbolic distance of greater than one excluding anchor objects tends to correlate with awareness of the implied hierarchy.


As an illustrative example (see image below): Given a `transitive chain' of five objects A, B, C, D, E} where we assume A is the lowest-valued object and E the highest, we begin with a demonstration phase in which we present the agent with pairs of adjacent objects <A, B>, <B, C>, <C, D>, <D, E> . In this demo phase we scramble the order in which the pairs are presented and also scramble the objects in the pair such that an agent may see <D, C> followed by <A, B>, etc. The pairs are presented one at a time, and the agent needs to correctly identify the higher-valued object in the current pair in order to proceed to seeing the next pair. Once the demo phase is completed, we show the agent a single, possibly-scrambled challenge pair. This challenge pair always consists of the object second from the left and the object second from the right in the transitive chain, in this case <B, D>. The agent's task is again to go to the higher-valued object. The value order is shown here solely for illustration; the agent cannot see it. Train and holdout test levels use a different object color set (stimuli), and the chain length on the transitive ordering (scale).


Transitive Inference task diagram

Transitive Inference: task layout

Transitive Inference: human agent play