ME 326 (CS 339R) Collaborative Robotics

Methodology

TidyBot++'s task execution is orchestrated by a ROS2 finite state machine (FSM) that sequences speech understanding, navigation, perception, and bimanual manipulation into a complete pick-and-place pipeline. The robot listens for a voice command, which is transcribed via Google Speech-to-Text and parsed by a Gemini LLM to extract a pick target (e.g. "banana") and a place destination (e.g. "basket" or "hand"). In parallel, a YOLOv8n detection node runs on the RealSense RGB stream to localize pick targets by class, while a MediaPipe Hands node detects a human hand's palm center as an alternative place destination, both using aligned depth imagery and pinhole back-projection to produce 3D poses in the camera frame. Once targets are identified, the FSM navigates the robot using Nav2 with SLAM-based localization, then triggers a grasp sequence that calls a numerical IK service with collision checking to plan and execute arm motions, and finally navigates to the destination to place the object. Each subsystem operates independently and communicates through ROS2 topics and services, allowing perception, navigation, and manipulation to be tested and swapped in isolation.

Finite State Machine

Perception

Voice commands are recorded and transcribed using Google Cloud Speech-to-Text, then parsed by a Gemini LLM with a constrained structured output schema that extracts a pick target object and a place destination: a basket or a human hand. Object detection uses a YOLOv8n model running on synchronized RGB and aligned depth frames from the RealSense camera, filtering for fruit and container classes, selecting the highest-confidence detection, and back-projecting its bounding box center to a 3D pose using real camera intrinsics and a median depth window. Human hand localization uses MediaPipe Hands to detect 21 hand landmarks, averaging the palm keypoints to produce a stable 3D palm center that serves as the place target when the user requests a hand delivery. These three perception components are implemented as independent ROS2 nodes: audio/natural language processing node, object (fruit, basket, bowl) detection node, and hand localization node.

Natural Language Understanding

Vision Perception

Navigation

TidyBot++'s navigation stack is built on ROS2 Humble and integrates three main components: SLAM Toolbox, Nav2, and custom sensor processing code written by our team. At a high level, ROS2 organizes robot software as a network of independent programs called nodes that communicate by publishing and subscribing to named data streams called topics, or by calling services and actions for request-response style coordination. Our pipeline starts with the robot's RealSense depth camera where the scan will be used as a lightweight representation of nearby obstacles for the rest of our stack. SLAM Toolbox subscribes to that scan and continuously builds an occupancy grid map of the environment while simultaneously tracking the robot's position within it, solving the classic "where am I, and what does the world look like?" problem in real time. That pose estimate is then handed off to Nav2, which handles all the autonomous navigation logic: a global planner computes an efficient path from the robot's current position to a goal, a local motion controller executes that path while reacting to dynamic obstacles, and a behavior tree coordinates recovery actions, like spinning in place or backing up, when the robot gets stuck. The result is a fully autonomous navigation pipeline where our team's code bridges the robot's hardware sensors into the standardized formats that SLAM and Nav2 expect, allowing the robot to explore, map, and navigate real environments without manual driving.

Navigation Pipeline

Path Planning and Local Control — Nav2

The full Nav2 stack handles autonomous navigation to a goal pose:

Exploration Pipeline

Manipulation

Manipulation is handled by a three-stage pipeline. First, a grasp generation node converts the object's 3D position from the camera frame into a top-down end-effector pose in the robot's base frame using live TF transforms, adding a fixed height offset and selecting between horizontal and vertical grasp orientations based on object geometry. Second, a motion planning service solves inverse kinematics for the target WX250s arm using numerical IK with multiple seed configurations, rejecting solutions that are near kinematic singularities or would cause arm-arm collisions. Finally, a grasp execution node manages the full pick-and-place sequence as a state machine: moving the arm to the target, verifying arrival via TF, closing or opening the gripper, retracting to a neutral pose, and confirming grasp success by checking finger joint position against an expected contact threshold.

Manipulation Pipeline

Page updated

Report abuse

Methodology

Perception

Navigation

Manipulation

Group 1: The Guinea Pigs