Generalization to New Actions in Reinforcement Learning

Ayush Jain* Andrew Szot* Joseph J. Lim

University of Southern California

International Conference on Machine Learning (ICML), 2020


How can agents solve decision-making tasks when the available actions (tools or skills) have not been seen before?

A fundamental trait of intelligence is the ability to achieve goals in the face of novel circumstances, such as making decisions from new action choices. However, standard reinforcement learning assumes a fixed set of actions and requires expensive retraining when given a new action set. To make learning agents more adaptable, we introduce the problem of zero-shot generalization to new actions. We propose a two-stage framework where the agent first infers action representations from action information acquired separately from the task. A policy flexible to varying action sets is then trained with generalization objectives. We benchmark generalization on sequential tasks, such as selecting from an unseen tool-set to solve physical reasoning puzzles and stacking towers with novel 3D shapes.


Carousel imageCarousel imageCarousel imageCarousel imageCarousel imageCarousel imageCarousel image

Generalization Results

We propose four benchmarking environments to evaluate the problem of generalization to new actions.

  • Chain Reaction Tool Environment (CREATE): Select which tool to place and where to place it to get the red ball to the goal location (green). Evaluates ability to select new tools.

  • Shape Stacking: Select which shape to place and where to place it above the table to stack the highest possible tower. Evaluates stacking with new shapes.

  • Grid World: Select 5-step skills to avoid lava and reach the goal. Evaluates utilizing a new skillset.

  • Recommender: Recommend items to users. Evaluated on new items. (No Videos)

The following videos are results of evaluating a learned policy on randomly sampled action sets. These were not hand-picked.

CREATE Obstacle

Training Examples

Testing Success

Testing Failures


Training Examples

Testing Success

Testing Failures


Training Examples

Testing Success

Testing Failures

Shape Stacking



Grid World



Testing on Out-of-distribution Actions

We test performance of a learned policy on unseen tool classes in CREATE environment and unseen shape classes in Shape Stacking environment.

  • CREATE Training Tools: Variations of Trampoline, Ramp, Ball, See-saw, Cannon, Bucket.

  • CREATE Testing Tools: Variations of Fan, Funnel, Conveyer Belt, Triangle, Lever.

  • Stacking Training Shapes: Variations of Domes, Rectangles, Capsules, Triangles, Arches, Spheres.

  • Stacking Testing Shapes: Variations of Cylinders, Tetrahedrons, Cubes, Cones, Angled-Rectangles, Angled-Triangles

For more details about these tools and shapes, please refer to CREATE Environment Details and Shape Stacking Environment Details.

Obstacle Training

Obstacle Testing

Seesaw Training

Seesaw Testing

Push Training

Push Testing

Shape Stack Training

Shape Stack Testing

More CREATE Tasks

The CREATE benchmark consists of 12 tasks in total. The videos below show evaluations on new actions with the same train-test split as the original 3 tasks above.










t-SNE Visualization of Learned Action Representations

We test whether the action encoder extracts semantic information from high-dimensional action observations. In the following visualizations, the action representations inferred for unseen actions are plotted and labeled with semantic information, such as the tool, shape, or skill class they belong to.

Environment Details