Scene 2

A: Glass Jar

B: Metal Mug

C: Paper container

D: Plastic cup

E: Tape

F: Wrench

Here we provide a scene from our real robot evaluation, and all tasks for it. We provide the object detections from OWL-ViT, with color-coded bounding boxes. We provide more descriptive labels for each object detection (these are not what our planner has access to). For each task, we provide videos of the robot executing plans generated using InstructBLIP and PG-InstructBLIP.

Task 1: Put all containers that can hold water to the side.

Task1_Original_nosound.mp4

InstructBLIP

Fail. It moves the paper container with holes and the tape, which is not a container.

Task1_Ours_nosound.mp4

PG-InstructBLIP (ours)

Success!

Task 2: Put all objects that are not plastic to the side.

Task2_Original_nosound.mp4

InstructBLIP

Fail. It classifies the Paper container as plastic and does not move it to the side.

Task2_Ours_nosound.mp4

PG-InstructBLIP (ours)

Success!

Task 3: Put all translucent objects to the side.

Task3_Both_nosound.mp4

InstructBLIP

Fail. It moved glass jar and tape even though they are transparent.

Task3_Both_nosound.mp4

PG-InstructBLIP (ours)

Fail. It moved glass jar and tape even though they are transparent.

Task 4: Put the three heaviest objects to the side.

Task4_Both_nosound.mp4

InstructBLIP

Success!

Task4_Both_nosound.mp4

PG-InstructBLIP (ours)

Success!

Task 5: Put a plastic object that is not a container into a plastic container.

InstructBLIP

Fail. It claimed the task was impossible since all plastic objects in the scene are containers.

Task5_Ours_nosound.mp4

PG-InstructBLIP (ours)

Success!