In robotics, there several ways to convey the task goal, including language-conditions, goal image, and goal videos. However, natural language can be ambiguous and images as well as videos can be over-specified. To address this issue, we propose an innovative approach using a straightforward and practical representation: crayon visual prompts, which explicitly indicate both low-level action and high-level planning. Specifically, for each atomic step, our method allows drawing simple yet colorful 2D visual prompts on RGB images to represent the required actions, i.e., end-effector pose, and moving direction. We devise a training strategy that enable the model to comprehend each color prompt and predict contact pose along with the movement direction in SE(3) space. Furthermore, we design an interaction strategy that leverages the predicted movement direction to form a trajectory connecting the sequence of atomic steps, thereby completing the long-term task. Through simple human drawing or automatic generated alternatives, we enable the model to explicitly understand its task objective and boost generalization ability on unseen tasks by providing model-understandable crayon visual prompts. We evaluate our method in both simulation and real-world environments, demonstrating its promising performance.
We utilize a sequence of images drawn with crayon visual prompts to express the planning steps, with each step illustrating the required low-level atomic actions, i.e.,t1-pick, t2-place, t3-pick, t4-place. For simple tasks, such as t2-place, there is no need to draw the moving direction. Based on the image sequence, the model determines the 6DoF contact pose, enabling it to contact the object as required. When a yellow prompt is present on the image, the model also predicts 3D movement directions, guiding the movement after contacting, e.g., picking upward in t1. By sequentially executing each step in the sequence, the overall task is completed.
We design training pairs conveying varied levels of information to enable the model to comprehend each visual prompt and introduce losses that guide it to predict accurate 3D directions.
During the inference stage, we allow users to draw crayon visual prompts on the image, which then serve as the visual input. The blue, red, and green prompts indicate the pose that the end effector should reach, while the yellow line represents the moving direction after contact. Meanwhile, we also provide an automated method to extract these visual prompts. First, we use Grounded-DINO to detect the object's bounding box and select its center, forming the blue circle. Then, we automatically generate 20 uniformly sampled 2D directional lines around the full 360-degree circle with the blue circle as the center. GPT-4 is then prompted to select lines from all candidates to represent the gripper's z-axis direction, y-axis direction, and moving direction, resulting in the red, green, and yellow lines, respectively.
'wipe the table'
'fold the cloth'
'heat the toaster'