LISTEN — Listen to human instruction; Record the robot's home position
SEARCH — Spin and explore the environment using frontier-based navigation until the target object is found via YOLO
PLAN_GRASP — Localize the object in 3D using depth back-projection and compute an optimal top-down grasp pose
PRE_GRASP — Move the arm to a position 10 cm above the object
GRASP — Descend, close the gripper, and lift
VERIFY_GRASP — Confirm the object is held using a second YOLO detection pass
RETURN — Navigate back to the recorded home position
DONE — Mission complete
The goal of this module was to let a user control the robot by speaking commands
The first step is a 5-second recording of someone speaking a command
That is then fed into Google's Speech API, which outputs the text from the recording
This output is then parsed via Google's Gemini API into a JSON format
Why YOLOv8?
“Lightweight” model
Real-Time Speed
Efficiency
Why the COCO Dataset?
COCO (Common Objects in Context) already knows the essentials
Zero Training Required
Filtering + GPD grasp candidates
GraspNet grasp candidates
GraspNet integrated with ROS + Gazebo
Raw depth data from RealSense
YOLO bounding box + expansion
Crop of depth image
Point cloud from depth image + coordinate transform to base_link coordinates
RANSAC filtering for floor removal
PC centroid min-width sweep
Selected grasp candidate (maximum clearance from sides)
Since the LiDaR was not fully set up, our team attempted to use the robot's RGBD camera to fill out a 2D occupancy map. In sim, this map was a discretized 300x300 cell grid, with each cell being 5 cm long. To fill out the map, we have the robot rotate 360º, sending out a set of ray casts at equally spaced angular intervals. The returned depths are converted to world-space obstacles. In order to mitigate mis-readings and misalignment, each cell stores a probability of being occupied. We require the robot to cast at least two rays from significantly different starting positions before it can determine an obstacle with 100% certainty (colored magenta in the visualizer).
Building the occupancy map with depth rays necessitates some additional considerations. Firstly, only obstacles with heights between 10cm and 80cm in world-space are stored in the map; this prevents the floor (and potentially ceiling or the underside of tables) from being detected and generating false positives in the map. Additionally, limiting the range of each ray between 0.4m and 5m helped filter out self-hits and noisy, long-distance readings. All cells along a ray, between the camera and the obstacle, are marked as free (using Bresenham's line algorithm).
Frontier-based exploration:
After each 360º scan, if the object was not yet seen or not yet reachable (no A* path), the robot created frontiers, which were free cells that bordered unknown space. The system grouped these frontier cells into clusters using flood fill, then filtered them. It discarded clusters with fewer than 8 cells, skipped clusters within 1 m of the robot to avoid revisiting nearby space, and rejected clusters that were not reachable via connected free space or were too close to obstacles (within a 35 cm clearance buffer). It ranked the remaining frontiers by the amount of surrounding unknown space divided by the distance from the robot, so the robot preferred to explore areas where it could learn the most per meter of travel. If the robot had already glimpsed the target object through YOLO but couldn't make an A* path to it yet, the scoring shifted to heavily favor frontiers near where it last saw the object.
To navigate to a frontier, the system ran A* search on the occupancy grid with 8-directional movement (cost 1.0 for cardinal, √2 for diagonal), using Euclidean distance as the heuristic and capping the search at 50,000 node expansions. It inflated obstacles by the robot's clearance radius during planning to ensure paths remained a safe distance from obstacles. It then simplified the resulting path to one waypoint per meter, and the robot followed these waypoints using proportional heading control with reactive depth-based obstacle checks.
Object approach:
When YOLO detected the target object in at least two frames within a 2-second window (guarding against false positives), the system confirmed the detection and estimated the object's world position by back-projecting its pixel location through the depth image. If possible, it then planned an A* path to that position and entered an approach phase. Otherwise, it would keep exploring until the map was full enough to map a path to the object.
During the object approach phase, we continuously checked whether the object stayed centered in the camera frame. If it drifted more than 60 pixels off-center, the robot stopped completely and rotated in place until it brought the object back within 40 pixels of center. The gap between these two thresholds prevented the robot from flickering between driving and correcting. After re-centering, the system re-planned the path from the robot's current position using the freshened estimate.
Pre-grasp positioning:
Once the robot was within 35 cm of the object, it did a sweep of camera tilt positions until the object was centered vertically in the frame. Then executed a top-down grasp using depth-based point cloud segmentation and inverse kinematics (detailed in the Grasping section). The full pipeline from voice command to exploration to detection to grasp ran end-to-end in simulation (yay!).
Navigating the Real World:
On the real robot, we unfortunately ran into several issues that made the map-based approach unworkable in our time frame (aw man..). The lab's dense clutter of tables and chairs added noise to the occupancy grid, which sometimes led to unreliable paths. In addition, the robot's motors drifted slightly, and the camera feed lagged behind the robot's actual position. We slowed the robot down to compensate, but that made each test run much longer and slowed our debugging cycle. There was also a real-life-only misalignment between the depth and RGB camera streams, which led to issues because the system built the occupancy grid from depth but pulled object positions from YOLO running on RGB, so the detected object landed at the wrong place in the world map.
With limited time left to fix the depth-RGB registration, we simplified the navigation to a purely vision-based method. Instead of building a map, the robot simply rotated in place to scan for the unoccluded target using YOLO, AprilTag, or face detection (depending on the target object), then drove toward it while continuously steering to keep the detection centered in the frame. This gave up obstacle avoidance and occlusion-robustness, but it proved far more reliable on the real hardware because it depended only on what the camera saw in the moment rather than on an accumulated map that had drifted out of alignment with reality.
I implemented the face_recognition library for lightweight face detection and encoding/matching. I could set up a directory of different "face sets", and pass these out through the perception node.
Reference: face imgs pre-loaded + encoded
128-dim feature vectors using HOG face detection + dlib's ResNet model
Matching: detected faces are compared against known encodings
Output: Each detection publishes a bounding box, person name (or "unknown"), and confidence score (1 - distance) at up to 10 Hz
We can use people/apriltags as “targets” for where to bring the object.
We had the pupil-apriltags detector running on our perception node which would look for apriltags in the current camera output. When seen, they would get published to a ROS topic.
Each apriltag would have objects mapped to its bin. In our test case, we were using tag 2 with YOLO object 'banana', so the "banana bin" was identified using tag 2.
For the full codebase, please refer to the GitHub Repository below in the Footer.