ME 326 (CS 339R) Collaborative Robotics
Our current system relies on RGB-D YOLO detection to identify a limited set of objects, such as bananas and books, along with MediaPipe Hands for handoff placement. In future iterations, we aim to replace this fixed-class detector with an open-vocabulary perception model such as Grounding DINO or OWLv2 running on an onboard GPU. This would allow the robot to search for a wide range of previously unseen objects directly from speech queries, eliminating the need to retrain detectors for each new class and use multiple classifiers. Additionally, we plan to introduce persistent semantic memory using semantic 3D SLAM, enabling the robot to build a long-term map of objects and environments. By storing observations such as object labels, locations, timestamps, and past actions in a relational database, the robot will gain a searchable memory of its environment, allowing it to reason over prior observations, revisit known object locations, and plan future actions with greater context.