We trained a YOLO model for object detection with the key classes we need to make the pipeline function. For the training data we experimented with making our own dataset but found it too time consuming and ended up using labeled data we found online. The model was trained on google colab before the weights were transferred to the lab and tested on sawyer.
Training the model took over 8 hours on GPU and for the training we used datasets ranging from hundreds to thousands of labeled images. The initial model would struggle to generalize and often overfit with a lower number of training samples leading to detection being dependent on viewing angle, dish type (material), color, lighting, proportion of dish in view, and more. Methods like retraining with a larger and more robust set of data, dealing with the possibility of uneven datasets overfitting via cross validation, tuning the network through feature extraction, and more gave a huge improvement that made all dishes detectable in almost all situations.
The Realsence camera can extract depth and generate 3D point clouds through stereo camera homography. This and the object detection allow us to detect objects in 3D space without having to mark objects with AR tags which is important in a wet workspace like ours.
Executed YOLO object detection and logged the bounding box coordinates (2D coordinates). Utilizing Intel Realsense D435i camera depth estimation and intrinsic modules, depth coordinates were acquired. Applying the classical computer vision concepts depicted below, we find X and Y in the spatial coordinates, Given x, y (the image coordinates) and f (the focal length, an intrinsic property of the camera).
The detection can be a bit spotty as you can see from the holes in the generated point-clouds. This issue is made worse by vibrations that shake the camera and changes in lighting. We did our best to overcome this by recalibrating the setup as much as possible and making sure to account for this in the system.
Fine-tuned the system by finding the most common depths of each pixel of the smaller sub-mask (removes background and gets a subsection at the center which is 25% of just the image) bounding box. We align the color and depth frames and average the depths of those points to enhance the accuracy of the 3D spatial coordinates. This method removes holes in the point-cloud and accounts for different geometry by taking depth across a subsection of the central surface. In practice these steps helped greatly improve consistency.
To accomplish this we use an AR tag on the robot and apply and save the transformation from the camera frame to the robot frame. All detected points are taken relative to this this camera robot transformation.
An AR tag was positioned at the base of the robot, and with the help of the lookup transform, rotation matrices, and the translation vectors we extracted and stored the transform data in a CSV file for future use. Utilizing the quaternion transformation formula, transformed coordinates were achieved and published to the subscriber node for path planning.
After a point is found and transformed into the robot frame paths are generated and pruned to have a safe set of motions that take the robot arm from the home position, down to a safe pick position, back to a rest and reorient position, before finally going to an appropriate drop position. After a drop is done in either level of the dish rack the robot safely backs out and finally returns to the home position to perform another pick.
The planner generates random path steps meaning that when it aims to reach its goal the number of intermediary steps may differ from run to run. We leveraged this by replanning to get a convergence by finding the shortest length, which approaches a predictable straight line path as you increase the number of plans generated.
The PID controller gains were tuned as best as possible to achieve the desired trajectory. This minimized random offsets of where the end-effector would pick and place the dishes.
If a pickable object is found in the scene the detected points are transformed and sent to the path planner via ROS topic communication where the detection code publishes to the topic and the path planner subscribes.