LLMs

   Overview of the LLM-based HuBotVerse framework robot control system 

LLMs are integrated into our system to interpret and respond to human input. 

LLMs enable our robots to understand a wide range of human inputs, enhancing the naturalness of interaction.

This includes processing text and audio from users to understand their intentions and provide appropriate responses or actions by the robot.


LLM-Aided Robot Control in Simulated Environment

Robotic virtual environments

A virtual simulation environment was built based on PyBullet and the Universal Robot UR5 robotic arm. The operating space in this environment measures 0.5x1m and utilizes the VIMA-Bench simulation suite. This suite includes an expandable collection of 3D objects and textures.

In the simulation environment, the 3D objects and textures include bowls and pots sourced from Google Scanned Objects, while other items are from Ravens.


In the virtual simulation environment, two observational perspectives are provided: a front view and a top-down view. This study primarily utilizes the top-down perspective. The end effector of the robotic arm is a suction cup. Additionally, the simulation environment integrates some basic operational actions, such as "pick and place," "rotate," and "push."

This study has constructed a virtual simulation scene. On the left is the robotic arm operating scene, which includes a Universal Robot UR5 robotic arm, an operation table, and various uniquely shaped objects. The robotic arm has six rotational joints, endowing it with the capability to perform diverse operations in the simulation environment. On the right is the camera perspective of the virtual environment. The camera uses a top-down view, clearly displaying the shape, texture, and relative position of each object, providing the system with a comprehensive understanding of the scene layout.

We can use the latest GPT-4 APIs. 

For example, when the input command is `put the apples in the fruit basket', the system will infer that this task requires retrieving apples from the environment and placing them in a fruit basket. Therefore, the system will call the Segment Anything Model (SAM) for segmentation and use the CLIP model to extract visual features of the apples and the fruit basket. It will then calculate the text-matching similarity between them to precisely locate the objects. Finally, the system will invoke the Pick and Place action to manipulate the objects at the specified location.

Video Demonstration : LLM for Simulated Robot Control

Virtual Environment Testing

The study conducted practical tests in the VIMA-BENCH simulation scene. As shown in the upper part of Figure, the scene includes a robotic arm, alphabet blocks, and colored squares. The user instruction was "Put the T in the blue box." The system successfully recognized this command and accurately located the "T" and the "blue box" in the scene. Combining the verb "Put" and the preposition "in," the system determined that it needed to perform the operation of "picking up T and placing it into the blue square." As shown in the lower part of the figure, the robotic arm successfully completed the actions of grasping, moving, and placing.

The experiment successfully verified that the system can not only accurately interpret instructions but also successfully translate these instructions into actual, complex physical actions, enabling the system to adapt to different unstructured environments.


LLM-Aided Robot Control in Physical Environment

Robot real environment

The study in a real environment primarily consists of three parts: a camera, an operation board, and a robot. The camera used is the Intel RealSense D435i depth camera. This camera's RGB and depth capabilities can output real-time video at up to 1280x720 resolution at 30 frames per second, with a depth range of 0.1m to 10m, providing the system with both RGB images and depth information of objects. The robot operates on the operation board, which measures 525mm x 415mm, larger than the robot's 320mm workspace, ensuring that operations are safe and controllable. The robot uses the Dobot Magician robotic arm, which has four degrees of freedom, a maximum payload of 500g, and a control accuracy of 0.2mm, serving as the system's executive component to execute instructions.


Real-world environment testing

In the test environment, there is a robotic arm, plates, and green vegetables. The system captures the scene image using the top-mounted RealSense camera.

The experiment uses "Put the greens on the plate" as the language instruction. As shown, the system successfully parsed this command and accurately located the "greens" and "plate" in the scene. Based on the words "Put" and "on" in the instruction, the system determines the need to execute the action of "picking up the greens and placing them on the plate." This validates the system's ability to correctly understand and execute language instructions in a real environment.


Video Demonstration : LLM for Physical Robot Control

In the future, the LLMs contribute to high-level reasoning, helping the robot to comprehend complex tasks, identify goals and constraints, and subsequently generate steps or actions to achieve the desired outcomes.


Overview of the Technical Framework (Case Study)

Task Instruction Layer. The standard input is the task's natural language description, and the output is the task text features T extracted by the large language model from the instructions.

Visual Segmentation Layer. The standard input is the environmental image I, and the output is the environmental image features Ii. Once the robot captures the scene image, the visual segmentation model outputs masks for all potential objects based on the input image and crops the corresponding parts of the image to obtain a series of environmental image features Ii.

Cross-Modal Matching Layer. The standard inputs are the task text features T and the environmental image features Ii, and the output is the task target image Ii. The text features T and environmental image features Ii are sent to their respective modal matching models to obtain corresponding matched image results.

Robot Action Layer. The standard input is the task target image Ii, and the output is the robot action. Using the robot's hand-eye coordination module, the central point location of the task target image is converted into the actual position in the robot's coordinate system. This information is then sent to the relevant action module to drive the robot to execute the task.