The struggle for the computer vision side of things was deciding how to implement the vision model. Initially, we took a very simplified approach for towel detection which just involed training a model to determine the colors in an image. If there was an overwhelming amount of red, we would assume that is the towel. This isn't very usable in real world scenarios because we are forcing our users to have a specific type of towel for our robot to work and there is alot of leeway for wrong object detection (imagine if we had another red object on screen). Above is the code to utilize this vision model.
We switched to using pre-trained models. Initially, we hoped to utilize one complete model that can identify everyday items and also be able to segment the human body at the same time. The most complete model was actually a combo of two models that we found. We wrote some custom code to integrate Meta's Segment-Anything model with Grounding Dino to utilize text prompting with incredibly accurate segmenting of objects. However, the combination of these models was both extremely large in size (multi-GBs) and resulted in a slow processing time per frame. While it allowed us to use one model for vision entirely to search for almost anything without fail, its computational intensity was too much of a burden to overlook. As a result, it would have been challenging to upload it to the Stretch Robot and have it perform real-time object detection without significant lag. To process a single image frame, it took almost 30 seconds on Google Collab with a T4 GPU. Instead, we had to compromise by training a customized YOLOV8 model (based of a YOLO-WORLD model) which doesn't offer the same level of accuracy as our Grounding Dino-Segment-Anything model but proved to be a better fit because of its lightweight and less intensive compute demands. As a result, we were able to utilize a model that provides adequate accuracy for the task we need while being extremely lightweight and fast.
Accuracy of the Grounding-Dino and Segment Anything Model
Accuracy of the pre-trained YOLO-WORLD model
Our custom YOLO-V8 model proved to be very responsive in detecting and tracking objects per live frame.
Static Object Detection (Moving Camera)
Moving Object Detection (Static Camera)
Towel Detector Code
We created a ROS detector object for towel's specifically, where we loaded a YOLOV8 model that determines whether an object is a towel or not. This code's format was based on object_detect_pytorch.py of the Stretch_ROS2 directory. Here, we take the returned results of the model and break them down into their appropriate components (class labels, box ids, and segmentation boxes). We also calculate the center point of the box.
Body Segmentation Vision Model
Similarly, we create a detector object for our body part segmentation model utilizing the same format as the aforementioned object_detect_pytorch.py. Here, we load in a pre-trained YOLOV8 model that is trained to segment body parts. Because we both used YOLOV8 models, this allowed the vision team to work on the same page and help one another even though we were assigned to different tasks (towel vs body part segmentation).
Raghav and Alex are close to being done with inverse kinematics. Alex had a brilliant idea that possibly moving the robot forward or right amounted to sending a negative x or y direction to the robot; this actually worked for some reason despite the documentation not indicating everywhere that the traditional forward and right directions required negative inputs. There is a lot of boiler plate code that should start seeing use once IK is completely finished. Raghav also finetuned the custom manual cleaning motions for future use.
Nelson worked on a callback system to avoid some extra threading and utilize some native capabilities of the joint trajectory action of ROS2. Most of it seems to work, but the callback function is not actually being called back. Otherwise it seems to work and there's likely a tedious bug to weed out. Nelson also rigged up a simple UI and integrated the main driver with the web interface so that the web UI can initiate the robot through a websocket. Nelson also hashed out several details regarding contracts between the web interface, the driver, and vision.