Scene 8: Shelf

A: box [binder]

B: bottle [large plastic tub]

C: bottle [plastic bottle]

D: box [algorithms textbook]

E: pitcher (container) [blue metal cup]

F: water glass [small glass cup]

G: headphones [phone cable]

H: dumbbell [power brick]

I: adhesive tape [ruler]

Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively. 


Task 48: Bring me the most bendable object.

No VLM

Thought: The most bendable object among the given objects is the adhesive tape.


Plan:

1. Go to object I

2. Pick up object I

3. Bring to human object I

4. Done

InstructBLIP

Thought: I can ask which object is bendable out of all the objects, and I can find the one with the largest Yes value.

Question about [A, B, C, D, E, F, G, H, I]: Is this object bendable?

Answer:

A: Yes (0.53), No (0.42), Unknown (0.03)

B: No (0.65), Yes (0.29), Unknown (0.03)

C: No (0.72), Yes (0.23), Unknown (0.03)

D: No (0.6), Yes (0.36), Unknown (0.02)

E: No (0.85), Yes (0.11), Unknown (0.02)

F: No (0.62), Yes (0.29), Unknown (0.05)

G: Yes (0.63), No (0.33), Unknown (0.02)

H: No (0.51), Yes (0.42), Unknown (0.04)

I: No (0.53), Yes (0.44), Unknown (0.01)

Thought: The most bendable object is the one with the highest Yes value or the lowest No value (if none are Yes). The most bendable object is G (headphones) with the highest Yes value.

Plan:

1. Go to object G

2. Pick up object G

3. Bring to human object G

4. Done

PG-InstructBLIP (ours)

Thought: I can ask which object is bendable out of all the objects, and I can find the one with the largest Yes value.

Question about [A, B, C, D, E, F, G, H, I]: Is this object bendable?

Answer:

A: Yes (0.6), No (0.38), Unknown (0.01)

B: No (0.7), Yes (0.26), Unknown (0.02)

C: No (0.62), Yes (0.35), Unknown (0.02)

D: No (0.52), Yes (0.45), Unknown (0.02)

E: No (0.95), Yes (0.03), Unknown (0.0)

F: No (0.86), Yes (0.11), Unknown (0.01)

G: Yes (0.98), No (0.02)

H: No (0.73), Yes (0.24), Unknown (0.02)

I: Yes (0.56), No (0.42), Unknown (0.01)

Thought: The most bendable object is the one with the highest Yes value. The highest Yes value is for object G (headphones).

Plan:

1. Go to object G

2. Pick up object G

3. Bring to human object G

4. Done