RH20T-P: A Primitive-Level Robotic Dataset Towards 

Composable Generalization Agents 

Zeren Chen1 2*, Zhelun Shi1 2*, Xiaoya Lu1 5*, Lehan He1 6*, 

Sucheng Qian1 3, Hao-Shu Fang3, Zhenfei Yin1 4, Wanli Ouyang1 4, 

Jing Shao1✉, Yu Qiao1, Cewu Lu3, Lu Sheng2 

1 Shanghai AI Laboratory, 2 School of Software, Beihang University, 3 Shanghai Jiao Tong University, 4 University of Sydney,

 5 University of Electronic Science and Technology of China, 6 Nanjing University of Posts and Telecommunications

*Equal Contribution   Corresponding author   Project Leader

Arxiv | Code & Dataset (Coming Soon ... )

In a world filled with a multitude of complex and varied tasks, how can we empower an agent to accomplish tasks it has never encountered during training? Recent research endeavors to address this by employing a high-level planner to orchestrate a novel task as the composition of trained primitive skills, which can be executed by low-level controllers step by step. We formulate this method as Composable Generalization Agents (CGAs). Despite the promising future, the community is not yet adequately prepared for CGAs, particularly due to the lack of primitive-level datasets. In this paper, we propose a primitive-level real-world robotic dataset, namely RH20T-P, which contains about 33000 video clips covering 44 diverse and complicated robotic tasks. Each clip is manually annotated according to a set of meticulously designed primitive skills, facilitating the future development of CGAs.  To validate the effectiveness of RH20T-P, we also construct a potential and scalable agent based on RH20T-P, called RA-P. Equipped with two planners specialized in task decomposition and motion planning, RA-P has excellent spatial perception and can adapt to novel physical skills through composable generalization.

Primitive-level Skill Definition

transferable, scalable, composable

Annotation

Each video episode showcasing a task is divided into multiple clips according to the transition of primitive skills. Each video clip is manually annotated with its corresponding start frame and end frame, as well as specific primitive skills. Additionally, the placeholders in some object-related primitive skills are annotated based on the video. We then utilize the control information in RH20T to generate the spatial information for each motion-based skill, including destination, trajectory, and direction.

Annotation demo of task "Wipe the tabletop with a sponge"


RH20T-P Characteristic

Complexity

I) Special trajectories: getting around the obstacle. 

II) Using tools: cutting the vegetables. 

III) Complex visual reasoning: placing a piece on the chessboard to complete the setup.

IV) Long-horizon planning: lighting up the lamp by plugging in the power cord and then turning on the socket.

Diversity

The distribution of physical skills (the y-axis indicates the number of clips).

Magnitude

The total number of annotated clips in each task.

Composable Generalization Agent (CGA)

To validate the effectiveness of RH20T-P, we propose a potential and scalable CGA, called RA-P (RobotAgent-Primitve). We split the functionality of the high-level planner into two components: the task planner, which linguistically decomposes the tasks, and the motion planner, which predicts the spatial motion trends of the robot arm for motion-based skills. The Plan-Execute paradigm of RA-P and the detailed architecture of high-level planner in RA-P are illustrated in (a) and (b), respectively. 

We evaluate RA-P on 8 novel tasks involving novel environments, novel objects and novel physical skills. We introduce ACT and GPT-4V as our counterparts.

The following real-world demo video presents a qualitative comparison between our model, RA-P, and the ACT baseline, which highlights the superior capabilities of our model in spatial perception, scene adaptation and object control, alongside its enhanced robustness to distractions. The first row displays the demos of RA-P, while the second row exhibits the demos of ACT.

Spatial Perception

Scene Adaptation

Object Diversity

Distractions

Citation

@article{chen2024rh20tp,

      title={RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents}, 

      author={Zeren Chen and Zhelun Shi and Xiaoya Lu and Lehan He and Sucheng Qian and Hao Shu Fang and Zhenfei Yin and Wanli Ouyang and Jing Shao and Yu Qiao and Cewu Lu and Lu Sheng},

      journal={arXiv preprint arXiv: 2403.19622},

      year={2024}

}