Scene 2: Art Table

A: suitcase [blue crate]

B: stool

C: hair dryer [mirror]

D: chair [chair that the mirror is on]

E: dishwasher [metal cabinet in top right]

F: chair [blue chair]

G: bottle [Elmer glue container]

H: bottle [Mod Podge container]

I: container [paint thinner container]

J: desk

K: mug [mug with paintbrushes]

L: facial tissue holder [container with glitter]

M: pencil

Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively. 

Task 10: Bring me the clear container with art supplies.

No VLM

Thought: There is no clear indication that the container (I) has art supplies, but it is the only container mentioned. I will assume it has the art supplies and bring it to the human.


Plan:

1. Go to object I

2. Pick up object I

3. Bring to human object I

4. Done

InstructBLIP

Thought: I can first ask which objects are clear containers out of all the objects I can carry.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M]: Is this object a clear container?

Answer:

A: No (0.94), Yes (0.05), Unknown (0.0)

B: No (0.73), Yes (0.22), Unknown (0.03)

C: No (0.74), Yes (0.22), Unknown (0.02)

D: No (0.64), Yes (0.28), Unknown (0.05)

E: No (0.73), Yes (0.23), Unknown (0.01)

F: No (0.54), Yes (0.42), Unknown (0.02)

G: Yes (0.58), No (0.39), Unknown (0.01)

H: Yes (0.5), No (0.47), Unknown (0.01)

I: No (0.82), Yes (0.16), Unknown (0.01)

J: No (0.58), Yes (0.35), Unknown (0.03)

K: No (0.83), Yes (0.15), Unknown (0.01)

L: Yes (0.95), No (0.04), Clear (0.0)

M: No (0.8), Yes (0.15), Unknown (0.03)

Thought: The clear containers are G, H, and L. I should ask which of these contains art supplies next.

Question about [G, H, L]: Does this object contain art supplies?

Answer:

G: Yes (0.69), No (0.29), Unknown (0.01)

H: Yes (0.95), No (0.04), Unknown (0.0)

L: Yes (0.97), No (0.02), Unknown (0.0)

Thought: The clear container with art supplies is the one with the highest Yes value. The answer is L.

Plan:

1. Go to object L

2. Pick up object L

3. Bring to human object L

4. Done

PG-InstructBLIP (ours)

Thought: I can first ask which objects are containers out of all the objects I can carry. Of these containers, I can ask which is clear and has art supplies.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M]: Is this object a container?

Answer:

A: Yes (0.97), No (0.02)

B: No (0.78), Yes (0.19), Unknown (0.01)

C: No (0.51), Yes (0.47), Unknown (0.01)

D: No (0.77), Yes (0.21), Unknown (0.0)

E: Yes (0.95), No (0.05), Container (0.0)

F: No (0.67), Yes (0.31), Unknown (0.01)

G: Yes (0.98), No (0.02), Container (0.0)

H: Yes (0.98), No (0.02), Container (0.0)

I: Yes (0.98), No (0.02)

J: Yes (0.61), No (0.34), Unknown (0.02)

K: Yes (0.98), No (0.02), Container (0.0)

L: Yes (0.97), No (0.03), Container (0.0)

M: No (0.95), Yes (0.04), Nothing (0.0)

Thought: The objects that are confidently containers are A, E, G, H, I, K, and L. I should ask which of these is clear and has art supplies.

Question about [A, E, G, H, I, K, L]: Is this object clear and contains art supplies?

Answer:

A: No (0.87), Yes (0.09), Unknown (0.03)

E: No (0.89), Yes (0.07), Unknown (0.03)

G: No (0.7), Yes (0.27), Unknown (0.01)

H: Yes (0.64), No (0.35), Unknown (0.0)

I: No (0.82), Yes (0.16), Unknown (0.01)

K: No (0.53), Yes (0.46), Unknown (0.0)

L: Yes (0.94), No (0.05), Unknown (0.0)

Thought: The clear container with art supplies is the one with the highest Yes value. The answer is L.

Plan:

1. Go to object L

2. Pick up object L

3. Bring to human object L

4. Done