Design & Methods

Overview

After taking a user language prompt input, the planner will use the RealSense RGB-D data stream to find an object relative to the robot pose, and plan the base and arm motion required to navigate to the object and to grasp it.

High Level Architecture

The robotic pipeline used in this project involves four ROS2 nodes:

audio transcription node
object detection node
base navigation node
gripper control node

State Machine

Object Detection

The audio from the Locobot's microphone is passed through Google's Speech-Text API to acquire the target object's name. This is passed through a YOLO-V11 object detection model along with the RealSense camera's real-time RGB-D image stream. If the object is in the camera frame, the planner publishes the center coordinates of the object.

Base Navigation

The target pose obtained from the object detection center coordinates is used to determine whether the robot's position and orientation is below the gripper reachable thresholds. If so, the robot is close enough to grasp the object, and publishes the flag directing the arm to begin gripping, but if not, it uses a P controller to move closer to the target object.

Gripper

With the center coordinates obtained from object detection, the gripper can move to a hover height above the object, open its jaws, move to the target pose on the object, close its jaws tightly around the object, and lift the object to its predesignated lift height.

« Back

Results »

Page updated

Report abuse