RT-Grasp: Reasoning Tuning Robotic Grasping 

via Multi-modal Large Language Model


Jinxuan Xu1,2, Shiyu Jin1, Yutian Lei1, Yuqian Zhang2, Liangjun Zhang1

1Baidu RAL, 2Rutgers University


CoRL_workshop_presentation_video_compressed.mp4

Interactive Refinement

Abstract

Recent advances in Large Language Models (LLMs) have showcased their remarkable reasoning capabilities, making them influential across various fields. However, in robotics, their use has primarily been limited to manipulation planning tasks due to their inherent textual output. This paper addresses this limitation by investigating the potential of LLMs for numerical predictions in robotics, specifically for robotic grasping. We propose Reasoning Tuning, a novel method that integrates a reasoning phase before prediction during training, leveraging the extensive prior knowledge and advanced reasoning abilities of LLMs. This approach enables LLMs, notably with multi-modal capabilities, to generate accurate numerical outputs like grasp poses. Additionally, we present the Reasoning Tuning VLM Grasp dataset, meticulously designed to facilitate the adaptation of LLMs to robotic grasping, and our dataset will be released soon. Extensive validation on both grasping benchmarks and real-world experiments underscores the adaptability of multi-modal LLMs for numerical prediction tasks in robotics. This not only expands their applicability but also bridges the gap between text-based planning and direct robot control, thereby maximizing the potential of LLMs in robotics.

Overview

The proposed method processes RGB images and user instructions to yield text outputs, which comprise both a reasoning phase and a numerical grasp pose prediction p={x,y,θ}. The reasoning phase analyzes the object's shape and structure based on its category and generates corresponding grasping strategies. Figure illustrates examples of reasoning templates in the training set for three distinct categories.


Reasoning Tuning VLM Grasp dataset

A data sample example in our dataset is shown below. The entire dataset will be released soon!


Some reasoning templates are shown below, and the entire collection can be found here.

Some instruction templates are shown below, and the entire collection can be found here.

Training

Real-world Experiments Results

RT-Grasp