Jianfeng Gao, Zhi Tao, Noémie Jaquier and Tamim Asfour
Visual imitation learning provides efficient and intuitive solutions for robotic systems to acquire novel manipulation skills. However, simultaneously learning geometric task constraints and control policies from visual inputs alone remains a challenging problem. In this paper, we propose the keypoint-based visual imitation learning (K-VIL) approach that automatically extracts sparse, object-centric, and embodiment-independent task representations from a small number of human demonstration videos. The task representation is composed of keypoint-based geometric constraints on principal manifolds, their associated local frames, and the movement primitives that are then needed for the task execution. Our approach is capable of extracting such task representations from a single demonstration video, and of incrementally updating them when new demonstrations are available. To reproduce manipulation skills using the learned set of prioritized geometric constraints in novel scenes, we introduce a novel keypoint-based admittance controller. We evaluate our approach in several real-world applications, showcasing its ability to deal with cluttered scenes, viewpoint mismatch, new instances of categorical objects, and large object pose and shape variations. Our evaluation demonstrates the efficiency and robustness of our approach in both one-shot and few-shot imitation learning settings.
(a) Human demonstration videos of manipulation actions involving categorical objects with shape, pose, and trajectory variations. (b) Sampling of dense candidate points from the object surface. (c) Extraction of sparse keypoints k1 , k2 subject to certain types of geometric constraints (point-to-point and point-to-curve), their associated local frames F and the movement primitives which represent the demonstrated keypoint motions. (d) Adaptation of the learned generalizable geometric task representation to a new scene, and execution by the robot.
We propose a principal constraint estimation (PCE) algorithm that covers a variety of geometric constraints, including 5 basic types: point-to-point (p2p), point-to-line (p2l), point-to-plane (p2P), point-to-curve (p2c), point-to-surface (p2S). In combination of them, it's also possible to represent colinear, coplanar, parallel, and perpendicular constraints. The linear constraints, i.e. the p2p, p2l, and p2P constraints, are estimated using the linear Principal Component Analysis (PCA), whereas the nonlinear constraints, i.e. p2c and pcS constraints, are estimated using the Principal Manifold Estimation algorithm.
Keypoint Admittance Control (KAC) handles a set number of prioritized keypoint-based geometric constraints defined above. Each keypoint is driven by an approaching force toward the principal manifold (orthogonal direction of the principal manifold) and a density force toward the area with a higher probability on the principal manifold (tangential direction of the principal manifold). The demonstrated motion style to fulfill the constraints is learned in the movement primitive, while the similarity of the object's final pose to the demonstrated targets is controlled by the density force. K-VIL's task representation also allows extrapolation of the keypoints' target positions on the principal manifolds. Therefore, this decomposition in both the learning phase and control phase balances imitation and extrapolation.
The following simulations in MuJoCo visualize two examples using KAC, where the task representation is learned from human demonstrations and adapted to a real scene. The corresponding real-world execution videos are also listed in the next section.
Approaching the insertion position with a p2p and a p2l constraint.
Pouring from a kettle into a teacup with a p2p and a p2P constraint.
A p2p (k0) and a p2l (k1) constraint are extracted from 3 demonstrations.
add legend
A p2p (k0) and a p2P (k1) constraint are extracted from 4 demonstrations.
add legend ...
In addition to the evaluation results in our paper and the accompanying video, we present more qualitative evaluations in 5 real-world applications.
K-VIL handles viewpoint mismatch in the three demonstrations (a)-(c) by aligning the corresponding local frames on the master object dustpan in (d), which results in an aligned viewpoint in (e). Two p2l constraints and their probability density functions on the principal lines are visualized. The robot reproduces the clean table task from a new viewpoint with a novel brush and dustpan (f), with the keypoints detected on the brush hair (g). The local frame on the dustpan is determined by the Q=50 neighboring points as shown in (h). (f) and (h) depict the keypoints and their movement primitives in 2D and 3D respectively.
Execution of the learned task from different viewpoints
Human demonstration
From 1 demonstration
Geometric Constraint
Three p2p constraints
Task representation
Task execution
The target position learned from a third-person view is not reachable.
From 3 demonstrations
p2p + p2l constraints
The target position learned from a third-person view is not reachable.
From 4 demonstrations
One p2p constraint
The motion of reaching the tissue is learned from human demonstrations, while the grasping and retrieving motions are predefined.
From 1 demonstration
Three p2p constraints
The target position learned from a third-person view is not reachable.
From 3 demonstrations
p2p + p2l constraints
The target position learned from a third-person view is not reachable.
From 4 demonstrations
One p2p constraint
From 1 demonstration
Three p2p constraints
From 3 demonstrations
p2p + p2l constraints
From 4 demonstrations
p2p + p2P constraints
From 11 demonstrations
p2p + p2c constraints
K-VIL generalizes to different kettles with large shape variations.
From 1 demonstration
Three p2p constraints
From 3 demonstrations
One p2p constraint
From 5 demonstrations
One p2p constraint
The constraints remain the same when more than three demonstrations are available.
From 1 demonstration
Three p2p constraints
From 3 demonstrations
p2p + p2l constraints
From 5 demonstrations
p2p + p2l constraints
The constraints remain the same even more demonstrations are available.
Same view angle
Different view angle