project

Contrastively Learning Visual Attention as Affordance Cues from Demonstrations for Robotic Grasping

Paper@IROS '21 | Paper(arXiv) | Code(GitHub)

Arizona State University

Example

An example of two mugs that human should have different affordance-effect judgments on their bodies or handles. Each mug has one depth image and one attention map that represents an affordance cue

Abstract

Conventional works that learn grasping affordance from demonstrations need to explicitly predict grasping configurations, such as gripper approaching angles or grasping preshapes. Classic motion planners could then sample trajectories by using such predicted configurations. In this work,our goal is instead to fill the gap between affordance discovery and affordance-based policy learning by integrating the two objectives in an end-to-end imitation learning framework based on deep neural networks. From a psychological perspective,there is a close association between attention and affordance.Therefore, with an end-to-end neural network, we propose to learn affordance cues as visual attention that serves as a useful indicating signal of how a demonstrator accomplishes tasks,instead of explicitly modeling affordances. To achieve this, we propose a contrastive learning framework that consists of a Siamese encoder and a trajectory decoder. We further introduce a coupled triplet loss to encourage the discovered affordance cues to be more affordance-relevant. Our experimental results demonstrate that our model with the coupled triplet loss achieves the highest grasping success rate in a simulated robot environment.

A robot could grasp a mug by reaching its gripper horizontally to the mug body, grasp the left and right sides of a handle, or grasp the front and back sides of a handle. Therefore, we have three candidate affordances (an object part and an applicable grasp): body-grasp, handle-left-right-grasp, and handle-front-back-grasp.

Method

Our framework learns to reproduce humans’ behavior with visual cues that hint at different affordances from human demonstrations. The learning of such visual cues is achieved by training a Siamese encoder, and the policy imitation is by a behavior-cloning-based trajectory decoder.

The Siamese encoder and trajectory decoder are trained simultaneously in a contrastive learning framework.

The detailed architecture of the trajectory encoder module. “×#” denotes that a neural network component needs to be replicated # times.

The detailed architecture of the trajectory decoder. “×#” denotes that a neural network component needs to be replicated # times.

Demonstration Video

Questions?

Contact Yantian to get more information on the project.

Page updated

Report abuse