Methodology

TidyBot++'s task execution is orchestrated by a ROS2 finite state machine (FSM) that sequences speech understanding, navigation, perception, and bimanual manipulation into a complete pick-and-place pipeline. The robot listens for a voice command, which is transcribed via Google Speech-to-Text and parsed by a Gemini LLM to extract a pick target (e.g. "banana") and a place destination (e.g. "basket" or "hand"). In parallel, a YOLOv8n detection node runs on the RealSense RGB stream to localize pick targets by class, while a MediaPipe Hands node detects a human hand's palm center as an alternative place destination, both using aligned depth imagery and pinhole back-projection to produce 3D poses in the camera frame. Once targets are identified, the FSM navigates the robot using Nav2 with SLAM-based localization, then triggers a grasp sequence that calls a numerical IK service with collision checking to plan and execute arm motions, and finally navigates to the destination to place the object. Each subsystem operates independently and communicates through ROS2 topics and services, allowing perception, navigation, and manipulation to be tested and swapped in isolation.