The project must include sensing, planning, and actuation. Beyond these three necessary elements, we were free to create and implement any project idea.
At a very high level, we wanted to design a robot that could recreate any given configuration constructed of the 7 Tetrimino pieces that would fit within a 36" x 30" table. For pick and place, we wanted to achieve roughly 5 mm tolerance in the difference between the piece's desired placement and actual placement. We also wanted to be able to achieve a solution time of roughly 2 minutes for a simple four piece configuration.
We were limited to the cameras on board the Baxter robot we had access to. We opted to use the wrist cameras in order to perform all visualizing tasks because the head camera is of significantly worse quality. The camera feed was processed with OpenCV functions, allowing us to contour the puzzle pieces and the table. The ar_track_alvar package was also very helpful in this phase of the workflow since it allowed us to properly interact with AR tags placed on our puzzle table.
Another option we explored was the use of an Intel RealSense camera which has significantly better image quality and more features than the Baxter wrist cameras. Despite the improved performance, using this type of camera required further external calibration for use with ar_track_alvar as well as some type of hardware fixture that would allow for stable and consistent camera placement during puzzle solves. This decision left us with the tradeoff between performance and convenience. In the end, the use of Baxter hand cameras led to high susceptibility to glare and oversaturated images that led us to come up with makeshift ideas to block out external light for the sake of a better image. We also made a collective decision to not use specific Tetrimino pieces that were not compatible with the lighting conditions in the lab. In the end, the use of an Intel RealSense camera or some other third-party camera would have required extra time to calibrate and setup, but could have led much higher quality images and thus be more robust to the lighting conditions.
For our planning, we heavily relied on the MoveIt commander as well as interactions with AR tags to physically manipulate pieces across our table. Through user input processing (specified in Implementation), we had information of the 2D locations for our pieces in the desired configuration, but still needed information on where our scrambled pieces existed in space as well as the z position of our desired configuration. Through our segmentation, we were able to single out different pieces based on their color and then perform a midpoint calculation to mark the optimal pickup position for each piece. At this stage, we performed a homography to take the table as seen from our Baxter wrist camera and map it onto a flat 2D plane. Then, we could simply convert 2D pixel values (x, y) of each piece's marked midpoints into a 3D pose (x', y', z) relative to the Baxter robot itself using the pose of the AR tag on the table. Now that we had our initial and desired poses for each piece, this information was passed into MoveIt in order to find a motion plan that would allow Baxter's hand to reach certain positions and orientations necessary for pick and place.
MoveIt was a very powerful tool that was fairly accessible to the group due to our prior experience in lab assignments. MoveIt was generally pretty quick in finding motion plans, when possible, or notifying us when a certain pose was not accessible. It was also fairly precise in that it would consistently reach the same arm configuration when given the same commands. However, we began to see some of the drawbacks of the controller once we started to implement orientation constraints on the suction gripper. Right away, we noticed that our likelihood of finding suitable motions plan decreased significantly, causing us to abandon implementing constraints altogether. With MoveIt, we had a quick and easy way to calculate motion plans, but suffered from the inability to provide orientation constraints, thus making it possible for our robot to inadvertently come into contact with obstacles such as tables and other puzzle pieces.
We decided to input the final desired puzzle configuration by passing in a string to our program. This was chosen over two alternatives: a different data structure or placing the pieces on the table and using computer vision. A different data structure would have resulted in similar results -- our algorithm for determining final destinations would fundamentally be the same. However, a string was used for simplicity, reducing development and debugging time to focus on more critical aspects of the project. In the same vein, using computer vision to determine the desired locations would add a layer of complexity, namely we would struggle with accurate piece identification and location from an image with poor quality.
Baxter will execute the motion plan provided by MoveIt, lower its hand and pick up the piece by turning on the suction gripper, and then wait for user input. This wait period was crucial because the image provided by the wrist camera was not clear enough to provide accurate AR tag information to determine the height of the puzzle as well as finding the true center of a puzzle piece. As such, oftentimes the gripper would be half on the puzzle piece and half off, or it would be a few millimeters too high above the piece; both of these situations resulted in the piece not being picked up. So, the user input would either tell Baxter to continue and move the piece to the final destination or attempt the pickup procedure again. After making the sure the piece was picked up, Baxter executed the next plan from MoveIt to the final destination, where it would attempt to drop off the piece. Again, it would wait for user input before continuing to pick up the next piece.
An ideal system in industry would not need to wait for user input because doing so would largely defeat the purpose of automation. An industrial application would reduce the amount of variability in their sensing that can arise from poor camera quality, changing lighting conditions that changed color threshold values, and the dependency on an AR tag, such that the process would reliably execute with high probability. This could come from external cameras (similar to a motion capture system) that provide more accurate 3D positions and pickup validation.