Zero-shot Policy Learning with Spatial Temporal RewardDecomposition on Contingency-aware Observation

This site provides supplemental material to the paper, Zero-shot Policy Learning with Spatial Temporal RewardDecomposition on Contingency-aware Observation.

Abstract

It is a long-standing challenge to allow an intel-ligent agent to learn in one environment and generalize toan unseen environment without further data collection andfinetuning. In this paper, we consider a zero shot generalizationproblem setup that complies with biological intelligent agents’learning and generalization processes. The agent is first pre-sented with previous experiences in the training environment,along with task description in the form of trajectory-level sparserewards. Later when it is placed in the new testing environment,it is asked to perform the task without any interaction with thetesting environment. We find this setting natural for biologicalcreatures and at the same time, challenging for previousmethods. Behavior cloning, state-of-art RL along with otherzero-shot learning methods perform poorly on this benchmark.Given a set of experiences in the training environment, ourmethod learns a neural function that decomposes the sparsereward into particular regions in an contingency-aware ob-servation as a per step reward. Based on such decomposedrewards, we further learn a dynamics model and use ModelPredictive Control (MPC) to obtain a policy. Since the rewardsare decomposed to finer-granularity observations, they arenaturally generalizable to new environments that are composedof similar basic elements. We demonstrate our method on awide range of environments, including a classical video game –Super Mario Bros, as well as a robotic continuous control task.Please refer to the project page for more visualized results.

The below are visualizations of greedy action using learned contigent aware observation. We found our decomposition generalize well to unseen environments

1-1 (Train)

2-1 (Test)

5-1 (Test)

5-1 ( Test)

A visualization of the predicted motion.

The multi-step prediction of the agent's location.

Code will be made available upon approval.


Execution of SAP Policy on the training task


Execution of NHP Policy on the training task


Execution of Behavioral Cloning on the training task


Execution of SAP Policy on the test task


Execution of NHP Policy on the test task


Execution of Behavioral Cloning on the test task