Bi-KVIL

Keypoints-based Visual Imitation Learning of 

Bimanual Manipulation Tasks

Abstract

Visual imitation learning has achieved impressive progress in learning unimanual manipulation tasks from a small set of visual observations, thanks to the latest advances in computer vision. However, learning bimanual coordination strategies and complex object relations from bimanual visual demonstrations, as well as generalizing them to categorical objects in novel cluttered scenes remain unsolved challenges. In this paper, we extend our previous work on keypoints-based visual imitation learning K-VIL to bimanual manipulation tasks. The proposed Bi-KVIL jointly extracts so-called Hybrid Master-Slave Relationships (HMSR) among objects and hands, bimanual coordination strategies, and sub-symbolic task representations. Our bimanual task representation is object-centric, embodiment-independent, and viewpoint-invariant, thus generalizing well to categorical objects in novel scenes. We evaluate our approach in various real-world applications, showcasing its ability to learn fine-grained bimanual manipulation tasks from a small number of human demonstration videos.

Symbolic: Master-Slave Relationship (MSR)

Subsymbolic: Keypoints-based Task Representation

Geometric Constraints

Comparison to K-VIL

K-VIL

human demonstration videos of unimanual pouring

extracted task representation including 

local frame, p2p, and p2c constraints

Single MSR

Reproduction with KAC

Bi-KVIL

human demonstration videos of bimanual pouring

extracted task representation including 

local frame, p2p, p2c, and pose constraints

Hybrid MSR

Reproduction with Bi-KAC

Evaluation

Place Spoon on plate (Plsp)

Plsp_1 (6): the plate moves to the initial position of the spoon head while keeping the spoon head right on top of the center of the plate.


The first column shows the demonstration videos with the candidate points, masks, and hand meshes/skeletons overlaid. The middle column displays the hybrid master-slave relationship graph, with concrete types of geometric constraints annotating each master-slave pair. The last column shows examples of these geometric constraints in the aligned view centered to the extracted local frame on the master object. Legend are discussed in the paper.

Loosely-coupled/Asymmetric, left-dominant

Human demonstration videos

HMSR

Plsp_2 (6): Similar to PS_1(6), except the plates may start from different positions above the table.

Loosely-coupled/Asymmetric, left-dominant

Task reproduction with categorical objects in novel cluttered scenes and with ARMAR-6 robot.


The first row shows the scene status before execution. Bi-KVIL uses these images to adapt the extracted sub-symbolic constraints to the corresponding categorical objects in the scene.  The sub-symbolic task representations are shown in the third row, including the local frame, keypoints, and the corresponding constraints, and, for clarity, purposes, the Movement Primitives (MPs) of the p2p constraints.

Plsp_3 (6): Similar to PS_1 (6), except that the spoon may be place anywhere on the plate. Note that the plate still moves to the initial position of the spoon head in this task.

Loosely-coupled/Asymmetric, left-dominant

The task representation corresponds to the first execution video

Plsp_4 (6): the plate moves to anywhere on the table while keeping the spoon head right on top of the center of the plate.

Loosely-coupled/Asymmetric, left-dominant

The task representation corresponds to the first execution video

Plsp_5 (6): place the spoon head on the center fo the plate with only one arm.

Uncoordinated unimanual

The task representation corresponds to the first execution video

Pour Water (Pow) and Pour Beer (Pob)

Pow (8) with an upright cup

Loosely-coupled/Asymmetric, right-dominant

Human Demonstration Videos

HMSR

Task reproduction and details

The RGB image and correspondence detection of DON

perceived point cloud and TCP poses

The p2p constraint

The p2c constraint

The execution status of both hands

Pow (8) with a tilt cup

Loosely-coupled/Asymmetric, right-dominant

Pow (8) with a tilt cup

Loosely-coupled/Asymmetric, right-dominant

Pow (8) cup is taken from a far position

Loosely-coupled/Asymmetric, right-dominant

Pow (8) with multiple cup

Loosely-coupled/Asymmetric, right-dominant

Place Spoon and Plate (Plsp,pt)

place the plate right above the center of the tablemat, while placing the spoon right above the center of the plate

Plsp,pt (6), with the plates and spoons starting from arbitrary positions

Loosely-coupled/Asymmetric, left-dominant

The task representation corresponds to the first execution video

Place Cutting board and Pan (Plcb,pa)

Plcb,pa (6): Transport the cutting board to the center of the pan while placing the pan at the center of a potmat.

Loosely-coupled/Asymmetric, left-dominant

The task representation corresponds to the first execution video

Plcb,pa (8)

Loosely-coupled/Asymmetric, left-dominant

Place serving Tray (Plst)

Plst (6): the serving trays start from an arbitrary position above the table and end at the center of the tablemat.

Tightly-coupled Symmetric

The task representation corresponds to the first execution video

Plst (6): the serving trays start from an arbitrary position above the table and end at an arbitrary position on the tablemat.

Tightly-coupled Symmetric

Place Spoon and Banana (Plsp,ba)

Plsp,ba (6): the left places the spoon on the plate while the right-hand places the banana on the tablemat. Both arms have no coordination.

Uncoordinated Bimnanual

The task representation corresponds to the first execution video

Clean Table (CT)

CT (6): The right arm moves the brush, which is constrained by two p2l constraints. 

Loosely-coupled/Asymmetric, right-dominant

Demonstrations

HMSR

the 1st p2l constraint (green)

The 2nd p2l constraint (blue)