Figure 1 above is an illustration of the grasp detection pipeline.
The RGB image captured by the RealSense camera is processed using a Yolo 11n model to identify objects in the scene. However, since Yolo 11n was not originally trained to recognize the specific objects relevant to this project, we pretrained the model on a custom annotated dataset consisting of 127 images. Figure 2 illustrates an example of the output generated by the pretrained Yolo model.
The labels of the bounding boxes, along with the user input, are encoded into a feature space using OpenClip. The bounding box with a label most similar to the user input is then selected using cosine similarity and cropped. This cropped image is passed to a CNN, which generates an affordance map. From this map, a pixel corresponding to the grasping point is selected to be sent to the grasping pipeline. Figure 3 provides an example of the generated affordance map. The CNN was trained on a dataset of 521 images featuring various cube-like objects.