Scene 4: Kitchen A

A: facial tissue holder [paper towel dispenser]

B: light switch [left electric outlet]

C: light switch [right electric outlet]

D: mixer

E: toaster

F: kettle

G: paper towel

H: water glass [plastic cup]

I: salt and pepper shakers [salt]

J: bottle [jam container]

K: frying pan [baking pan]

L: container [salmon-colored container]

M: salt and pepper shakers [pepper]

N: countertop

Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively. 

Task 25: Bring me the most deformable container with a lid.

No VLM

Thought: Since there is no clear information about the deformability of the containers, I will choose the container L as it has a lid.


Plan:

1. Go to object L

2. Pick up object L

3. Bring to human object L

4. Done

InstructBLIP

Thought: I can first ask which objects are containers with lids out of all the objects I can carry. Of these containers, I can ask which is deformable.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N]: Is this object a container with a lid?

Answer:

A: Yes (0.82), No (0.16), Unknown (0.0)

B: No (0.86), Yes (0.11), Unknown (0.01)

C: No (0.83), Yes (0.14), Unknown (0.01)

D: Yes (0.64), No (0.3), Unknown (0.02)

E: Yes (0.6), No (0.36), Unknown (0.02)

F: Yes (0.79), No (0.19), Unknown (0.0)

G: No (0.59), Yes (0.35), Unknown (0.03)

H: Yes (0.97), No (0.02)

I: Yes (0.97), No (0.02), Jar (0.0)

J: Yes (0.85), No (0.14), Unknown (0.0)

K: No (0.5), Yes (0.47), Unknown (0.01)

L: Yes (0.93), No (0.06), Unknown (0.0)

M: Yes (0.98), No (0.02)

N: Yes (0.82), No (0.15), Unknown (0.01)

Thought: The objects that are confidently Yes are A, D, E, F, H, I, J, L, and M. I should ask which is deformable next.

Question about [A, D, E, F, H, I, J, L, M]: Is this object deformable?

Answer:

A: No (0.49), Yes (0.46), Unknown (0.03)

D: Yes (0.83), No (0.12), Unknown (0.02)

E: Yes (0.55), No (0.38), Unknown (0.03)

F: No (0.55), Yes (0.42), Unknown (0.02)

H: No (0.52), Yes (0.44), Unknown (0.02)

I: No (0.59), Yes (0.36), Unknown (0.02)

J: Yes (0.83), No (0.15), Unknown (0.01)

L: Yes (0.85), No (0.13), Unknown (0.01)

M: No (0.63), Yes (0.32), Unknown (0.02)

Thought: The most deformable object is the one with the highest Yes value or the lowest No value (if none are Yes). Since some of the answers are Yes, the answer is L (the highest yes value).

Plan:

1. Go to object L

2. Pick up object L

3. Bring to human object L

4. Done

PG-InstructBLIP (ours)

Thought: I can first ask which objects are containers with a lid out of all the objects I can carry. Of these containers, I can ask which is deformable.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N]: Is this object a container with a lid?

Answer:

A: Yes (0.51), No (0.46), Unknown (0.02)

B: No (0.95), Yes (0.04), Unknown (0.0)

C: No (0.93), Yes (0.06), Unknown (0.0)

D: No (0.79), Yes (0.2), Unknown (0.0)

E: No (0.87), Yes (0.12), Unknown (0.0)

F: No (0.6), Yes (0.39), Unknown (0.0)

G: No (0.94), Yes (0.05), Unknown (0.01)

H: Yes (0.68), No (0.3), Unknown (0.0)

I: Yes (0.96), No (0.04)

J: Yes (0.88), No (0.11), Unknown (0.0)

K: No (0.86), Yes (0.13), Unknown (0.0)

L: No (0.7), Yes (0.28), Unknown (0.01)

M: Yes (0.97), No (0.03)

N: No (0.64), Yes (0.31), Unknown (0.02)

Thought: The objects that are confidently Yes are A, H, I, J, and M. I should ask which is deformable next.

Question about [A, H, I, J, M]: Is this object deformable?

Answer:

A: No (0.75), Yes (0.22), Unknown (0.01)

H: No (0.7), Yes (0.28), Unknown (0.01)

I: No (0.85), Yes (0.13), Unknown (0.01)

J: Yes (0.56), No (0.42), Unknown (0.01)

M: No (0.88), Yes (0.1), Unknown (0.01)

Thought: The most deformable object is the one with the highest Yes value or the lowest No value (if none are Yes). Since some of the answers are Yes, the answer is J (the highest yes value).

Plan:

1. Go to object J

2. Pick up object J

3. Bring to human object J

4. Done