Scene 2: Art Table
A: suitcase [blue crate]
B: stool
C: hair dryer [mirror]
D: chair [chair that the mirror is on]
E: dishwasher [metal cabinet in top right]
F: chair [blue chair]
G: bottle [Elmer glue container]
H: bottle [Mod Podge container]
I: container [paint thinner container]
J: desk
K: mug [mug with paintbrushes]
L: facial tissue holder [container with glitter]
M: pencil
Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively.Â
Task 10: Bring me the clear container with art supplies.
No VLM
Thought: There is no clear indication that the container (I) has art supplies, but it is the only container mentioned. I will assume it has the art supplies and bring it to the human.
Plan:
1. Go to object I
2. Pick up object I
3. Bring to human object I
4. Done
InstructBLIP
Thought: I can first ask which objects are clear containers out of all the objects I can carry.
Question about [A, B, C, D, E, F, G, H, I, J, K, L, M]: Is this object a clear container?
Answer:
A: No (0.94), Yes (0.05), Unknown (0.0)
B: No (0.73), Yes (0.22), Unknown (0.03)
C: No (0.74), Yes (0.22), Unknown (0.02)
D: No (0.64), Yes (0.28), Unknown (0.05)
E: No (0.73), Yes (0.23), Unknown (0.01)
F: No (0.54), Yes (0.42), Unknown (0.02)
G: Yes (0.58), No (0.39), Unknown (0.01)
H: Yes (0.5), No (0.47), Unknown (0.01)
I: No (0.82), Yes (0.16), Unknown (0.01)
J: No (0.58), Yes (0.35), Unknown (0.03)
K: No (0.83), Yes (0.15), Unknown (0.01)
L: Yes (0.95), No (0.04), Clear (0.0)
M: No (0.8), Yes (0.15), Unknown (0.03)
Thought: The clear containers are G, H, and L. I should ask which of these contains art supplies next.
Question about [G, H, L]: Does this object contain art supplies?
Answer:
G: Yes (0.69), No (0.29), Unknown (0.01)
H: Yes (0.95), No (0.04), Unknown (0.0)
L: Yes (0.97), No (0.02), Unknown (0.0)
Thought: The clear container with art supplies is the one with the highest Yes value. The answer is L.
Plan:
1. Go to object L
2. Pick up object L
3. Bring to human object L
4. Done
PG-InstructBLIP (ours)
Thought: I can first ask which objects are containers out of all the objects I can carry. Of these containers, I can ask which is clear and has art supplies.
Question about [A, B, C, D, E, F, G, H, I, J, K, L, M]: Is this object a container?
Answer:
A: Yes (0.97), No (0.02)
B: No (0.78), Yes (0.19), Unknown (0.01)
C: No (0.51), Yes (0.47), Unknown (0.01)
D: No (0.77), Yes (0.21), Unknown (0.0)
E: Yes (0.95), No (0.05), Container (0.0)
F: No (0.67), Yes (0.31), Unknown (0.01)
G: Yes (0.98), No (0.02), Container (0.0)
H: Yes (0.98), No (0.02), Container (0.0)
I: Yes (0.98), No (0.02)
J: Yes (0.61), No (0.34), Unknown (0.02)
K: Yes (0.98), No (0.02), Container (0.0)
L: Yes (0.97), No (0.03), Container (0.0)
M: No (0.95), Yes (0.04), Nothing (0.0)
Thought: The objects that are confidently containers are A, E, G, H, I, K, and L. I should ask which of these is clear and has art supplies.
Question about [A, E, G, H, I, K, L]: Is this object clear and contains art supplies?
Answer:
A: No (0.87), Yes (0.09), Unknown (0.03)
E: No (0.89), Yes (0.07), Unknown (0.03)
G: No (0.7), Yes (0.27), Unknown (0.01)
H: Yes (0.64), No (0.35), Unknown (0.0)
I: No (0.82), Yes (0.16), Unknown (0.01)
K: No (0.53), Yes (0.46), Unknown (0.0)
L: Yes (0.94), No (0.05), Unknown (0.0)
Thought: The clear container with art supplies is the one with the highest Yes value. The answer is L.
Plan:
1. Go to object L
2. Pick up object L
3. Bring to human object L
4. Done