The “Inverse Chopsticks” project, addresses a critical bottleneck in contemporary robotic manipulation: the “Generalization Problem,” where traditional imitation learning methods excel at specific, pre-programmed tasks but fail to adapt to novel scenarios. To bridge this gap, this research proposes a framework for real-time interaction that leverages the advanced reasoning and adaptability of Large Language Models (LLMs) and Vision Language Models (VLMs). Driven by the motivation to realize a “Human Say; Robot (GPT) Do” paradigm , the system utilizes a multi-agent architecture—comprising Planner, Coder, and Supervisor agents —connected via a Model Context Protocol (MCP) Server to translate semantic intents into safe, granular robotic operations. Implemented on an ABB 120 robotic arm with a custom “reversed gripper” , this framework enables the autonomous assembly of complex voxel structures from natural language and visual prompts, effectively transforming the robot from a static tool into an adaptable, collaborative partner.
Prior to the "Inverse Chopsticks" initiative, the team conducted "At a Stretch," a preliminary series of robotic 3D painting experiments designed to validate workflows for converting digital geometry into physical robotic motion using an ABB industrial arm. Leveraging Rhino and Grasshopper for collision-aware path planning, trials progressed from continuous single-line drawings on flat panels to intricate Kolam patterns on non-planar surfaces, and finally to expressive multi-stroke compositions utilizing variable brush tilt and pressure. This foundational work established critical protocols for custom toolpath generation, kinematic control, and variable motion profiling, serving as a vital technical precursor to the advanced manipulation and digital-physical integration developed for the Inverse Chopsticks project.
To facilitate the precise handling of hollow voxel modules required for continuous, gap-free stacking, the project team engineered a specialized “Customized Reversed Gripper,” known as the “Inverse Chopsticks” mechanism. Unlike standard parallel-jaw grippers that grasp from the exterior, this end-effector operates by inserting custom-fabricated fingers into the central void of a standardized cube and applying outward pressure to secure it via friction fit. Adapted from an existing dFab laboratory framework with modified actuation logic, the gripper transitions between a retracted “Release State” for insertion and an expanded “Hold State” for lifting, a design choice that crucially leaves the module’s outer surfaces free for placing blocks directly adjacent to one another. This hardware is seamlessly integrated into the robotic control loop through the ROS/COMPAS environment, where Digital I/O signals trigger the expansion (Logic “1”) or retraction (Logic “0”) within a strictly defined sequence to ensure successful attachment before manipulation.
To streamline the manipulation workflow and simplify coordinate planning, the robot’s physical environment is spatially segmented into two primary functional zones: the Home Desk and the Task Desk.
Home Desk (Perception & Retrieval): This area serves as the staging ground for raw materials. Here, the robot utilizes its vision system to perform object detection (“See objects”) to identify available voxel modules and executes the “Pick” operation using the custom gripper.
Task Desk (Assembly & Verification): This zone is dedicated to the actual construction of the target model. The robot transports the gripped module to this location (“Move it”), verifies the spatial alignment (“Check the position”), and executes the “Place” command to stack the module onto the growing structure.
This structured division enables a cyclical and predictable workflow—Home -> Pick -> Task -> Place -> Home—reducing the complexity of the planner agent’s collision avoidance calculations.
The initial framework was designed around a modular abstraction divided into four key domains: VLM (Intelligence), Connector (Translation), Robot Environment (Execution), and Modules (Physical Objects). This stage proposed two distinct setups:
Baseline Setup: This setup focused on structured predictability. It utilized prompts for voxel models to drive LLM actors, which communicated via a server to predefined functions. The robot arm, equipped with the custom gripper, interacted with basic cubes placed on a specific grid, simplifying the perception task
Advanced Setup: This setup introduced unstructured complexity. It aimed to handle varied modules with different shapes placed on random positions. This required a more robust feedback loop where the camera’s visual input played a critical role in updating the LLM actors about the dynamic environment.
The final pipeline crystallizes these concepts into a specific technology stack, explicitly defining the hardware-software interfaces.
AI & Reasoning Layer (ChatGPT): The process begins with Prompts for voxel models fed into LLM actors (powered by ChatGPT). These actors act as the system’s brain, determining the sequence of actions required to build the target model.
The Connector Bridge (MCP / COMPAS): A critical refinement in the final workflow is the use of the MCP Server and COMPAS as the bridge. The LLM sends intents (likely in JSON format) to the MCP Server, which translates them into Predefined functions safe for robotic execution.
Execution Layer (ROS & ABB 120): The translated commands are passed to the ROS (Robot Operating System) layer, which manages the ABB 120 robotic arm. The Robotic arm with custom gripper physically executes the task.
Hardware Integration: A specific Arduino signal is integrated into the loop to control the pneumatic/servo actuation of the custom gripper, ensuring precise attachment and detachment.
Closed-Loop Feedback: The pipeline maintains a closed loop where the Camera observes the Home Desk (source) and Task Desk (assembly). This visual data is fed back to the LLM actors, allowing the system to verify that the “Human say” command has resulted in the correct “Robot do” action.
To validate the "Inverse Chopsticks" framework, the project employed a robust computer vision pipeline using a RealSense camera to capture RGB and depth data, enabling the creation of a precise digital twin through pre-calibration of the camera and robot coordinate systems. Experimental validation proceeded in two distinct phases: "Demo 1" focused on manual inputs to verify mechanical kinematics, while "Demo 2" successfully realized the "Human Say; Robot Do" paradigm, where ChatGPT autonomously generated execution plans from visual data to manipulate colored voxel modules. Despite this success, the team identified significant challenges in the integration of independent modules, noting that minor vision misalignments could cascade into physical failures and that necessary safety checks introduced latency that reduced interaction fluidity. To address the limitations of the current static workspace, future work will focus on implementing real-time object detection via on-robot cameras to allow for dynamic adjustments during assembly.