Our overall progress with the robot would be deemed successful. It was able to perform tasks that satisfy the base criteria that we had defined at the start of the project:
Localize and recognize cards with the camera:
The computer vision service was highly reliable in detecting the number and suit of the cards, achieving a >99% mean Average Precision (mAP) on a validation set of synthetic images it wasn't trained on.
The twod_to_3d node was successfully able to transform the image coordinates detected with the Baxter camera into 3D coordinates. Qualitatively, we were able to observe the high accuracy of the computer vision function by manually moving the cards and verifying the locations with the Baxter arm and tf_echo. The joint use of 2 AR tags allows Baxter to reliably translate the 2D pixel coordinates from the neural network to 3D coordinates that it uses to store the location of detected cards and access them as necessary.
Consistently pick and place cards on a table:
Once the vacuum gripper had finally been correctly installed, we were able to interface to it with gripper commands and actuate the suck-and-release action for cards successfully. There were a few failure modes, such as planning inaccuracies causing the gripper to make contact with the card on an edge or corner and breaking the air seal required for suction. When the deck of cards is slanted, the gripper has also accidentally picked up multiple cards at once. Aside from the limitations detailed below, the overall system was still able to achieve a reasonable level of consistency in operation.
Monitor and respond to the current state of the game, making legal moves on its turn:
The game design itself has been relatively simple, our gameplay node allows the robot to make the optimal decision at the appropriate turn of the game, staying in an idle state until the human players interact with the gamestate. We also went beyond the original game design for the Baxter and included a 'cheating-detection' mode, giving us the ability to keep the human players on track.
Difficulties Encountered
For software, our main difficulty was correctly calculating the 3D transforms of the cards. Our computer vision node returns 2D pixel coordinates of each card’s corners, all of which we needed to transform into 3D coordinates in the base frame of the Baxter. We ended up modifying the C++ source code of ar_track_alvar in order to output the pixel coordinates of the AR tags, but there still remained various inaccuracies in the math to convert between pixels and 3D distance.
However, the most time-consuming obstacle we had to overcome was hardware-based — the installation and interface with the Baxter vacuum gripper. Here's an itemized list of 30 days of disagreements, it goes as follows:
The vacuum gripper was not actually installed on the Archytas Baxter robot we were assigned to
We switched robots, but the gripper was incorrectly installed on Ayrton
Switched again, gripper reinstalled correctly on Asimov --> Asimov's cameras did not work, URDF issues abound
Another switch, gripper transplanted back to Ayrton
Ayrton broke the morning of the showcase :(
While filming our video, the vacuum gripper had an air leak that tended to blow away cards more than normal
Flaws & Improvements
If we had more time to work on this project, we would most likely want to focus on path planning first; some attempts to play card games with Baxter were thwarted by MoveIt giving us very inefficient paths that had us worried for the robot's safety or took a long time to execute. These could be mitigated by:
Mapping out Baxter's comfortable reachable workspace and making sure we only play the game in that area,
Adding an option in the code to replan a faulty motion path,
Implementing a better path planner than MoveIt.
We would also want to improve our computer vision pipeline to be more accurate in predicting bounding boxes and cards:
Our localization (2d_to_3d) service could be tuned to become more accurately by:
Using more than two AR tags to estimate the depth of the table from the camera to increase accuracy in case one is at a weird angle or position.
Making use of data from multiple frames of the camera (while moving the camera around slightly) to automatically detect the depth of the cards based on the pointcloud generation techniques from our computer vision lab.
Using either of the two approaches above to account for the slant of the table to give a more accurate depth estimate of each card.
We could finetune the neural network training further for our task by:
Training on actual images from Baxter's camera to include more realistic lighting conditions in the training data,
Using yolov5 instead of yolov4 for the convolutional neural network,
Using card predictions from multiple views (E.g. only include a detected card if it was detected from 2 out of 3 vantage points to exclude false positives),
Train the network on higher resolution images (and figure out if the slower processing time would be worth it).
With a model trained on higher resolution images, our vision setup could benefit by increasing the camera's field of view to include more of the playing field. This could be achieved by:
Placing the camera and gripper on different hands
Using an external webcam with more AR tag localization
We also had one small hardware issue with the vacuum gripper that we can try to fix in software: the vacuum gripper sometimes releases a gust of air when releasing suction which tends to blow away light objects like cards or AR tags. There is a parameter for setting the strength of the suction in the ROS API, so we could experiment with that to minimize the risk of accidentally blowing away gameplay pieces while still being able to reliably pick up objects. We unfortunately did not have the time to tune this parameter over the course of our project because we got the vacuum gripper working very late in the project due to the difficulties listed above.
Finally, we could be more ambitious in our goals and attempt other types of non-AR-tag-based object interaction like flipping cards, poker chips, or other objects.