Exploiting Hybrid Policy in Reinforcement Learning for Interpretable Temporal Logic Manipulation

Hao Zhang, Hao Wang, Xiucai Huang, Wenrui Chen, and Zhen Kan*

Paper

Dataset

Github

Abstract

Reinforcement Learning (RL) based methods have been increasingly explored for robot learning. However, RL based methods often suffer from low sampling efficiency in the exploration phase, especially for long-horizon manipulation tasks, and generally neglect the semantic information from the task level, resulted in a delayed convergence or even tasks failure. To address these issues, we develop a Temporal-Logic-guided Hybrid policy framework (HyTL) which exploits three-level decision layers to facilitate robot learning. Specifically, the task specifications are encoded via linear temporal logic (LTL) to improve performance and offer interpretability. And a waypoints planning module is designed with the feedback from the LTL-encoded task level as a high-level policy to improve the exploration efficiency. The middle-level policy chooses which behavior primitives to implement and the low-level policy determines how to interact with the environment. We evaluate HyTL on four challenging manipulation tasks, which demonstrate its effectiveness and interpretability.

Simulation Results

We recorded the following videos to visualize our simulation results for four long-horizon manipulations. The full training videos and evaluation videos are published in the Dataset.

Stack

Stack.mp4

Nut Assembly

Nut Assembly.mp4

Cleanup

Cleanup.mp4

Peg Insertion

Peg Insertion.mp4

Framework

Figure: The framework of HyTL. (a) The architecture of Transformer Encoder for representation LTL instrucuions. (b) The LTL progression for progressing LTL formulas. (d) The challenging manipulation situations for simulation study.

Skills, Skill Descriptions, and Corresponding LTL Instructions

Complete Quantitative Results

Figure: Plots of normalized reward curves. These learning curves show the mean and standard deviation of the episodic task reward throughout training. All experiments are averaged over 6 seeds.

As shown in Figure, it is observed that

1) algorithms guided by waypoints (Mapleway and HyTL) are more efficiently sampled and converge faster than those without waypoints (Maple, MapleLTL2Action, and TARPsTF−LTL);

2) HyTL exhibits better performance relative to TARPsTF−LTL and MapleLTL2Action which only incorporate the task semantics;

3) On the most challenging task Peg Insertion, relative to all baselines HyTL demonstrated the shortest converge episode, with over 30% reduction compared to the best previous work TARPsTF−LTL

From the above observation, we conjecture that the unstable representation of the task module at the beginning of training affects the performance of the agents, which rely heavily on the task representation to improve the sampling efficiency.

In contrast, the performance of HyTL, which has a hybrid decision architecture design, is not only affected by the task module’s representation, but also relies on the waypoints generated by the planning module for guidance. Once either the planning module or the task module has been effectively updated, the gradient decreases rapidly in the direction that contributes to task completion. It is due to this complementary design in the hybrid decision architecture that HyTL can demonstrate higher learning efficiency.

Primitive Compositionality Quantification

Table: The compositionality score of different methods in four manipulations.

The compositionality scores are shown in Table, in which higher scores reflect better compositionality and more stable performance. As shown in Table, Mapleway and HyTL can select and combine more appropriate behavior primitives by guiding within waypoints, resulting in higher compositionality scores than other methods.

Figure: The visualization of action sketchesof HyTL with 6 seeds, which shows the different action primitives that HyTL selects and combines in accomplishing above four manipulation tasks.

Interpretability via AttCAT

Figure: Heatmap of the normalized impact scores from different Transformer layers on the instruction. The top and bottom rows reflect the process of the comprehension of the agent to the LTL instruction.

As shown in the top table of Figure, the cumulative scores for all tokens except for the eventually(-0.85) are almost zero, reflecting the fact that the agent doesn’t have a clear concept of LTL instruction at the beginning of the training.

When Transformer converges, higher impact scores from all layers focus on the token jello_pushed(+0.50) as shown in the bottom row of Figure, which implies the agent having a greater probability of going directly to the position whose corresponding proposition is jello_pushed

Page updated

Google Sites

Report abuse