K-VIL

K-VIL: Keypoints-based Visual Imitation Learning

Jianfeng Gao, Zhi Tao, Noémie Jaquier and Tamim Asfour

Abstract

Visual imitation learning provides efficient and intuitive solutions for robotic systems to acquire novel manipulation skills. However, simultaneously learning geometric task constraints and control policies from visual inputs alone remains a challenging problem. In this paper, we propose the keypoint-based visual imitation learning (K-VIL) approach that automatically extracts sparse, object-centric, and embodiment-independent task representations from a small number of human demonstration videos. The task representation is composed of keypoint-based geometric constraints on principal manifolds, their associated local frames, and the movement primitives that are then needed for the task execution. Our approach is capable of extracting such task representations from a single demonstration video, and of incrementally updating them when new demonstrations are available. To reproduce manipulation skills using the learned set of prioritized geometric constraints in novel scenes, we introduce a novel keypoint-based admittance controller. We evaluate our approach in several real-world applications, showcasing its ability to deal with cluttered scenes, viewpoint mismatch, new instances of categorical objects, and large object pose and shape variations. Our evaluation demonstrates the efficiency and robustness of our approach in both one-shot and few-shot imitation learning settings.

paper

code

(a) Human demonstration videos of manipulation actions involving categorical objects with shape, pose, and trajectory variations. (b) Sampling of dense candidate points from the object surface. (c) Extraction of sparse keypoints k1 , k2 subject to certain types of geometric constraints (point-to-point and point-to-curve), their associated local frames F and the movement primitives which represent the demonstrated keypoint motions. (d) Adaptation of the learned generalizable geometric task representation to a new scene, and execution by the robot.

Principal constraint estimation

We propose a principal constraint estimation (PCE) algorithm that covers a variety of geometric constraints, including 5 basic types: point-to-point (p2p), point-to-line (p2l), point-to-plane (p2P), point-to-curve (p2c), point-to-surface (p2S). In combination of them, it's also possible to represent colinear, coplanar, parallel, and perpendicular constraints. The linear constraints, i.e. the p2p, p2l, and p2P constraints, are estimated using the linear Principal Component Analysis (PCA), whereas the nonlinear constraints, i.e. p2c and pcS constraints, are estimated using the Principal Manifold Estimation algorithm.

Keypoint-based admittance control

Keypoint Admittance Control (KAC) handles a set number of prioritized keypoint-based geometric constraints defined above. Each keypoint is driven by an approaching force toward the principal manifold (orthogonal direction of the principal manifold) and a density force toward the area with a higher probability on the principal manifold (tangential direction of the principal manifold). The demonstrated motion style to fulfill the constraints is learned in the movement primitive, while the similarity of the object's final pose to the demonstrated targets is controlled by the density force. K-VIL's task representation also allows extrapolation of the keypoints' target positions on the principal manifolds. Therefore, this decomposition in both the learning phase and control phase balances imitation and extrapolation.

The following simulations in MuJoCo visualize two examples using KAC, where the task representation is learned from human demonstrations and adapted to a real scene. The corresponding real-world execution videos are also listed in the next section.

Approaching the insertion position with a p2p and a p2l constraint.
Pouring from a kettle into a teacup with a p2p and a p2P constraint.

KAC simulation: approaching the insertion position

A p2p (k0) and a p2l (k1) constraint are extracted from 3 demonstrations.

add legend

KAC simulation: pouring from a kettle to a teacup

A p2p (k0) and a p2P (k1) constraint are extracted from 4 demonstrations.

add legend ...

Real-world experiments

In addition to the evaluation results in our paper and the accompanying video, we present more qualitative evaluations in 5 real-world applications.

Clean table with a dustpan and a brush

K-VIL handles viewpoint mismatch in both demonstrations and reproductions

K-VIL handles viewpoint mismatch in the three demonstrations (a)-(c) by aligning the corresponding local frames on the master object dustpan in (d), which results in an aligned viewpoint in (e). Two p2l constraints and their probability density functions on the principal lines are visualized. The robot reproduces the clean table task from a new viewpoint with a novel brush and dustpan (f), with the keypoints detected on the brush hair (g). The local frame on the dustpan is determined by the Q=50 neighboring points as shown in (h). (f) and (h) depict the keypoints and their movement primitives in 2D and 3D respectively.

Execution of the learned task from different viewpoints