Implementation

I. YOLO object detection

We trained a YOLO model for object detection with the key classes we need to make the pipeline function. For the training data we experimented with making our own dataset but found it too time consuming and ended up using labeled data we found online. The model was trained on google colab before the weights were transferred to the lab and tested on sawyer.

Training and detection

Training the model took over 8 hours on GPU and for the training we used datasets ranging from hundreds to thousands of labeled images. The initial model would struggle to generalize and often overfit with a lower number of training samples leading to detection being dependent on viewing angle, dish type (material), color, lighting, proportion of dish in view, and more. Methods like retraining with a larger and more robust set of data, dealing with the possibility of uneven datasets overfitting via cross validation, tuning the network through feature extraction, and more gave a huge improvement that made all dishes detectable in almost all situations.

II. Intel Realsense camera for 3D coordinates in camera frame

The Realsence camera can extract depth and generate 3D point clouds through stereo camera homography. This and the object detection allow us to detect objects in 3D space without having to mark objects with AR tags which is important in a wet workspace like ours.

Conversion of 2D spatial coordinates to 3D spatial coordinates

Executed YOLO object detection and logged the bounding box coordinates (2D coordinates). Utilizing Intel Realsense D435i camera depth estimation and intrinsic modules, depth coordinates were acquired. Applying the classical computer vision concepts depicted below, we find X and Y in the spatial coordinates, Given x, y (the image coordinates) and f (the focal length, an intrinsic property of the camera).

The detection can be a bit spotty as you can see from the holes in the generated point-clouds. This issue is made worse by vibrations that shake the camera and changes in lighting. We did our best to overcome this by recalibrating the setup as much as possible and making sure to account for this in the system.

Accuracy Enhancement for achieved 3D spatial coordinates

Fine-tuned the system by finding the most common depths of each pixel of the smaller sub-mask (removes background and gets a subsection at the center which is 25% of just the image) bounding box. We align the color and depth frames and average the depths of those points to enhance the accuracy of the 3D spatial coordinates. This method removes holes in the point-cloud and accounts for different geometry by taking depth across a subsection of the central surface. In practice these steps helped greatly improve consistency.

III. Transformation from camera frame to Sawyer frame with AR tags

To accomplish this we use an AR tag on the robot and apply and save the transformation from the camera frame to the robot frame. All detected points are taken relative to this this camera robot transformation.

Transformation from camera frame to sawyer base frame

An AR tag was positioned at the base of the robot, and with the help of the lookup transform, rotation matrices, and the translation vectors we extracted and stored the transform data in a CSV file for future use. Utilizing the quaternion transformation formula, transformed coordinates were achieved and published to the subscriber node for path planning.

IV. Path planning with careful constraints applied to end-effector for pick and place operation

After a point is found and transformed into the robot frame paths are generated and pruned to have a safe set of motions that take the robot arm from the home position, down to a safe pick position, back to a rest and reorient position, before finally going to an appropriate drop position. After a drop is done in either level of the dish rack the robot safely backs out and finally returns to the home position to perform another pick.

Path Planning with Convergence

The planner generates random path steps meaning that when it aims to reach its goal the number of intermediary steps may differ from run to run. We leveraged this by replanning to get a convergence by finding the shortest length, which approaches a predictable straight line path as you increase the number of plans generated.

Tuning PID Controller Gains

The PID controller gains were tuned as best as possible to achieve the desired trajectory. This minimized random offsets of where the end-effector would pick and place the dishes.

V. Multi node communication for data transfers

If a pickable object is found in the scene the detected points are transformed and sent to the path planner via ROS topic communication where the detection code publishes to the topic and the path planner subscribes.

Report abuse