SynPo: Synthesized Policies

for Transfer and Adaptation across Environments and Tasks

The ability to transfer in reinforcement learning is key towards building an agent of general artificial intelligence. In this post, we consider the problem of learning to simultaneously transfer across both environments and tasks, probably more importantly, by learning from only sparse (ENV, TASK) pairs out of all the possible combinations. In this post, we highlight the main idea of our paper and show the major experimental results.

Transfer Learning across Environments and Tasks

(a) Transfer Setting 1

(b) Transfer Setting 2

(c) Transfer Setting 3

We consider a transfer learning scenario in reinforcement learning that considers transfer in both task and environment. Three different settings are presented here (see text for details). Note here red dots denote seen combinations, gray dots denote unseen combinations, and arrows denote transfer direction

Composing Environment and Task Specific Policies

An overview of the pipeline of SynPo. Given a task and an environment, the corresponding embeddings are retrieved to compose the policy coefficients and reward coefficients. Such coefficients then linearly combine the shared basis and synthesize a policy (and a reward predictor) for the agent.

Experiment Setups and Results

GridWorld and Tasks. We design twenty 16x16 grid-aligned mazes (e.g. left image). The mazes are similar in layout but differ from each other in topology. There are five colored blocks as ''treasures'' and the agent's goal is to collect the treasures in pre-specified orders (e.g. pick up red treasure and then pick up green treasure). At each time, the "egocentric'' view (e.g. right figure) observed by the agent consists of the agent's surrounding within a 3 x 3 window and the treasures' locations. The locations of agent and treasures are randomized each time. We consider twenty tasks in each environment, resulting 400 pairs of (ENV, TASK) pairs in total.

Transfer setting 1. In this setting (cf. Figure 1(a)), we randomly choose 144 (seen) pairs as the training set under the constraint that each of the environments appears at least once, so does any task. The remaining 256 (unseen) pairs are used for testing. We report average success rate (avgSR) over 100 tests and also variance across three random training seeds.

Main results. We compare our SynPo against other approaches and visualize the results in the left most figure as below, where we found that SynPo is most effective comparing to others. Next, we study the number of seen (ENV, TASK) pairs for effective transfer learning performance on UNSEEN combinations. Finally, we perform reinforcement fine-tuning on our proposed approach to answer whether it is beneficial to learn policy with RL for transfer learning.

(1) Comparison against different methods. SynPo achieves the best results (on UNSEEN).

(2) How many seen (ENV, TASK) pairs we need to transfer well? 40% is the magic number.

(3) Does reinforcement learning help transfer? Yes, it helps, especially for SynPo.

A qualitative example of the synthesized policies that succeed in a task only after fine-tuning with reinforcement learning. It is interesting that though the overall trajectory of the synthesized policies stay mostly the same to the imitation learned ones, it knows better about the target task and tends to not pick the "wrong treasure" up along its way.

(Note the target task is to "first pick up blue and then pick up purple")

Imitation

Imitation + RL

Optimal

Transfer setting 2 & 3. In this setting (cf. Figure 1 (b) & (c)), the seen pairs (denoted by P) are constructed from ten environments and ten tasks; the remaining ten environments and ten tasks are unseen (denoted by Q).

One transfer learning option corresponds to the setting 3. The agent is to learn policies (via learning embeddings then composing) on the target Q pairs. Then using the embeddings from P and Q, we can synthesize policies for two sets of 10 x 10 (ENV, TASK) pairs where the environment is taken from P and the task is taken from Q or vice versa. We call such pairs "cross pairs''. This mimics the style of "learning in giant jumps and connecting dots''.

Another option is to learn "cross pairs'' first -- this corresponds to the setting 2 as the embeddings for either the unseen tasks or the unseen environments need to be learnt, but not both. Once these embeddings are learnt, we use them to synthesize policies for pairs in the set Q. This mimics the style "incremental learning of small pieces and integrating knowledge later''.

Main results. Table 2 contrasts two strategies of transfer learning. Clearly, Setting 2 attains stronger performance, as "incremental learning'' requires less training samples since it learns only the embeddings for either the tasks or the environments but not both while Setting 3 requires learning both simultaneously. It is interesting to see this result seems to align with how effective human learning is. We also compare SynPo to MLP in this setting, and found out that SynPo generalize better in all cases.

Transfer Setting 2

Transfer Setting 3

Finally, we visualize the AvgSR for on each pair of (Env, Task) as the left two images, under setting 2 and setting 3. Here AvgSRs are reported by averaging the success rate over 10 random runs, and we mark them in the grids.

The purple cells are results from Q set and red cells represents the rest (The darker the color the better the performance). It suggests that cross-task transfer is easier than cross-environment.

For people that are interested about the further details about implementation and experiment, please refer to the main paper and supplementary material. Please consider cite the following entry if you are using any related resource for your research.