Stage Conscious Attention Network (SCAN):

A Demonstration-Conditioned Policy for Few-Shot Imitation

AAAI 2022

Jia-Fong Yeh*, Chi-Ming Chung*, Hung-Ting Su, Yi-Ting Chen, Winston H. Hsu

Project Description

Few-shot imitation learning (FSIL) aims to let agents behave in unseen environments by learning from few expert demonstrations. Using behavioral cloning (BC) to solve this problem becomes a popular research direction. The following capabilities are essential in robotics applications:

  • Behaving in compound tasks that contain multiple stages.

  • Retrieving knowledge from few length-variant and stage-misaligned demonstrations.

  • Learning from a different type of expert.

We found that no previous work can achieve these abilities at the same time. Therefore, we conduct the FSIL problem under the union of above (capability) settings and introduce a novel stage conscious attention network (SCAN) to retrieve knowledge from few demonstrations simultaneously.


Our contributions can be summarized as follows:

  • SCAN is the first work that solves the FSIL problem under the union of three capability settings.

  • SCAN can detect important frames of misaligned stages and is robust to length-variant demonstrations.

  • Experiment results demonstrate that SCAN can perform complicated compound tasks without fine-tuning and provide an explainable visualization.

You can refer to the official paper (coming soon), arxiv paper (with supplementary materials), and github repository (coming soon) by using the above buttons.

Method

Demonstration-conditioned policy [1] generates actions conditioned on current observation and given demonstrations. Note that there may be several knowledge-retrieval methods during the progress. We claim that this policy can learn the relationship between expert motion and agent motion implicitly.


Stage conscious attention network (SCAN) contains three main components, including the visual head, stage conscious attention (SCA), and actornet. For the detailed architecture of these components, please refer to the (arxiv) paper. Here, the stage conscious attention aims to find the (soft, relaxed) stage information for each received observation during the playout. Thus, it computes the attention scores between the embeddings of each playout observation and demonstration frames and uses the scores as weights to generate the task-embeddings. Intuitively, the frames in the same stage should contain similar knowledge/information. Therefore, the task-embeddings can guide the actornet for a better understanding of the task.

Experiment Results

Compared baselines are BC, meta-BC (MIL [2], DAML [3]), TaskEmb [4], and TANet. These baselines cover mainstream families in the FSIL problem. All baselines apply our visual head and actornet for a fair comparison.

Experiment of compound task aims to verify the performance of proposed SCAN and baselines in FSIL problem with aforementioned capability settings. In this experiment, the PP task and PPP task are the 2-stages and 3-stages compound task, respectively. Following are the result figures of all methods. We can find that proposed SCAN achieves the best performance in most cases.

  • Visualization of attention map

It's obviously that the stage conscious attention (SCA) can focus on the import frames of demonstrations by giving the playout observation. The demonstration frames in the same stage as the current observation have the higher attention scores, indicating that these frames have greater influence when the actornet computes the actions.

Attention map of PP task

(2 stages)

Attention map of PPP task

(3 stages)

Experiment of noisy demonstration further validates the robustness of proposed stage conscious attention (SCA). The noisy demonstration contains trivial moves but still complete the task. Moreover, all models did not process the noisy demonstrations during triaining. Our SCA attends to each demonstration first and then fuse their contexts. From the left figure, SCA can still locate the crucial frames in noisy demonstration and provide informative embeddings to actornet. Hence, from the right figure, we can observe that proposed SCAN is only slightly affected by the noisy demonstrations and its performance does not drop dramatically.

Demo video of SCAN provides the motivation, the comparison of attention mechanism, and a playout execution. It also illustrates that the proposed SCAN is powerful and has impressive performance with explainable visualization.

Reference

  1. Dance et al., "Demonstration-Conditioned Reinforcement Learning for Few-Shot Imitation", ICML 2021.

  2. Finn et al., "One-Shot Visual Imitation Learning via Meta-Learning", CoRL 2017.

  3. Yu et al., "One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning", RSS 2018.

  4. James et al., "Task-Embedded Control Networks for Few-Shot Imitation Learning", CoRL 2018.

Team

Chi-Ming Chung

NTU

Hung-Ting Su

NTU


from

NTU, Taiwan

NYCU, Taiwan