Learning from Pixels with Expert Observations

Minh-Huy Hoang¹*, Long Dinh²*, Hai Nguyen³

*Equal contribution

¹University of Science, Ho Chi Minh City, Vietnam

²Hanoi University of Science & Technology, Hanoi, Vietnam

³Northeastern University, Boston, USA

Arvix

Code

Accepted at IROS 2023.

Abstract

In reinforcement learning, sparse rewards can present a significant challenge. Fortunately, expert actions can be utilized to overcome this issue. However, acquiring explicit expert actions can be costly, and expert observations are often more readily available. This paper presents a new approach that uses expert observations as intermediate visual goals for learning in manipulation tasks with sparse rewards from pixel observations. Our technique involves using expert observations as visual goals for a goal-conditioned RL agent, enabling it to complete a task by successively reaching a series of goals. We demonstrate the efficacy of our method in five challenging block construction tasks in simulation and show that when combined with a DQN agent and an imitation learning agent, our approach can significantly improve their performance while requiring 4-20 times fewer expert actions during training.

Video

List of tasks

We consider five challenging block structure construction tasks in the BulletArm benchmark, where a robot arm must construct a desired structure from unstructured blocks using depth images of the scene taken by a camera on top.

Overview

Our proposed agent is hierarchical, comprising of a fixed top level and a learned bottom level. At the top level, we utilize the indices of expert observations within an expert episode as abstract goals for the bottom level to accomplish. The bottom level implements a goal-conditioned policy that gradually realizes these goals until the task is completed. We determine goal achievement by comparing the abstract state produced by a state abstractor with the corresponding abstract goal. The state abstractor is a multi-class classifier pre-trained using supervised learning with expert transitions.

Generating Expert Demonstrations

0 → 6: A deconstruction planner decomposes a fully built structure by randomly picking the highest block and placing it on the ground until the structure is fully decomposed. Reverting 6 → 0 results in a demonstration episode. Besides indexing, numbers are also used for abstract states.

Approach

We present an example of our agent performing a block construction task with the target structure placed in the center. At the beginning, the agent starts in state s, where all the blocks are on the ground. This corresponds to an abstract state (6). To build the desired structure, the agent needs to transition to a state where it picks up a red block, which should resemble the scene labeled as number (5), making the abstract goal equivalent to the abstract state minus one. As the agent only needs to take one action to reach the desired state s′ from s, achieving the abstract goal in the next time step would require only one action. If successful, the agent is presented with another goal, and the process continues until the task is completed. Otherwise, we reset the episode.

Results

DQN/SDQfD-x: Baselines.
LEO-DQN/SDQfD-x (Non-Inv): Our method with/without equivariant state abstractors.
h-DQN/SDQfD-x(Original/Modified): Hierarchical baselines based on h-DQN agent with/without additionally rewarding at the top level.
x: number of demonstration episodes (with expert actions).

Evaluation success rate averaged over four seeds with shaded one standard error. The top row (grey background) contains DQN-based agents, and the bottom row (white background) contains SDQfD-based agents.

Performance in House-Building-1 and House-Building-2 when no expert actions are used.

Architecture

Q-function

State Abstractor

Citation

@article{hoang2023learning,

title={Learning from Pixels with Expert Observations},

author={Hoang, Minh-Huy and Dinh, Long and Nguyen, Hai},

journal={arXiv preprint arXiv:2306.13872},

year={2023}

}

Page updated

Google Sites

Report abuse