Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to learn generalizable multi-task policies along with small non-pretrained models to adapt to high resolution feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2x on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.
Example success rollouts: Correct block-hand alignment and precise grasping are important for task success.
Example failure rollouts: Caused by incorrect alignment and the inability to react fast to contact events.
In these videos (on Right) we try to show the dynamic nature of our ballbot based pickup task. As can be seen in the below videos, the ballbot base consists of a single spherical wheel, which means that the robot needs to dynamically balance at all times. This dynamic balancing results in upper-body motion as the robot moves. Furthermore, as the arms are stretched out, the body leans back for center of mass compensation. Such complex whole-body motions affect the end-effector motion dynamically. Overall, the dynamic pickup task requires fast reaction to contact events, otherwise the object will tip over because of the robot body’s velocity.
As can be seen in the following qualitative results, a low temporal-resolution policy (i.e. where all input modalities are processed at 5Hz), does not lead to a very reactive policy. Hence, the policy is unable to respond fast to contacts made with the wooden peg. Since completing this task requires the policy to find (approximately localize) the insertion location, a slow to react policy is often unable to find the insertion location with the time horizon.
Successful executions
Failed Executions
Task 1: Initial Instruction: Lift the blue shoe. New Instruction: Lift the green shoe.
Task2: Initial Instruction: Reach above red block. New Instruction: Reach above green block.
Task 1: Success Examples
Task 1: Failure Examples
Task 2