We used Baxter, his arm camera, and the vacuum gripper attachment.
We built three different ROS nodes: vision_node, twod_to_3d, and the gameplay_node, to handle the computer vision aspects of our project and the gameplay aspects, respectively. We also created a launch file, vision.launch to launch all the vision and MoveIt related dependencies at once. This includes the vision_node, twod_to_3d, ar_track_alvar, MoveIt, the joint_trajectory_action_server.
The vision node was built on top of a tiny-yolov4 convolutional neural network (CNN) that was trained as follows:
Pictures of each card were taken by a group member's phone camera
OpenCV was used to extract the playing card from the image and convex hull of the corner of the playing card (where the number and suit of the card are)
50,000 synthetic images were generated with multiple playing cards overlaid on top of a variety of textures taken from the Describable Textures Dataset (https://www.robots.ox.ac.uk/~vgg/data/dtd/index.html). The dataset collection process was heavily inspired by (https://github.com/geaxgx/playing-card-detection), to which I added some more augmentation techniques.
Image augmentation techniques used on the cards themselves include:
Brightening/Dimming
Color Distortion
Rotation
Translation
Shear
With a dataset of 50,000 images, we trained a tiny-yolov4 model for classifying 52 different cards using the publicly available Darknet library (https://githubmemory.com/repo/AlexeyAB/darknet), starting from a checkpoint pretrained on MS-COCO. The resulting model achieved a >99% mean Average Precision (mAP) score on a 10,000 image validation set, which we generated similarly to step 3. We also hand tested it with a few pictures taken from Baxter and our phone cameras of different cards and it seemed to be able to classify the cards fairly reliably.
Once the neural network was trained, we subscribed vision_node to the /cameras/right_hand_camera topic. We used ROS's cvbridge to translate from the ROS Image message to an OpenCV image. Then we cropped the center (e.g. the middle 800 pixel width rectangle from a 1280x800 image) and resized it to a resolution closer to what the network was trained on (600x600) before using Darknet's Python API to run tiny-yolov4 on the image. The network output bounding boxes for the corners of the cards (in pixel coordinates) as well as the classification of what type of card it detected. Each detected card would be translated to the Card message type (string card_type, int corner1, int corner2, int width, int height) which described the bounding box and card type. These would be compiled into a CardList message for each image (a list of Cards) that listed every card detected in the image. The CardList was published by the vision_node to the /pokerbot/cards topic.
Our twod_to_3d service (named so because ROS complained that a service cannot start with a number) subscribes to the /pokerbot/cards topic, the /ar_corners topic, and the /ar_pose_markers topic. Note also that we added the ar_corners topic to ar_track_alvar, which posts the pixel coordinates of the AR tags. We coded two different ways the service could take the card bounding boxes in latest image from /pokerbot/cards in pixel coordinates and turn them into 3D coordinates relative to the robot's base frame.
Our first approach was to have two AR tags on the table and use their known 3D poses (from ar_pose_markers) and known 2D pixel coordinates (from /ar_corners) to get scaling ratios between pixels in the x and y direction to centimeters in the real world. We used the 3D poses of the markers to estimate the z coordinate of all the cards.
Our second approach was to have two AR tags on the table and use their 3D poses to compute the depth of the table plane with respect to the camera frame. Then we multiplied the cards' pixel coordinates with the inverse camera intrinsic matrix (K^-1) and scaled the result by the computed depth parameter to get the cards' coordinates with respect to the camera. Then we called TFListener to get the transformation from the camera to the base and applied the same transform to the cards to get their coordinates with respect to the base frame.
Finally, the gameplay_node has an initial step of going to a set "bird's eye view" (BEV), a vantage point where the camera can see all the AR tags and cards on the field. Once there, it identifies the location of the deck and what card is at the top of the deck by calling the service /twod_to_3d.
Let us first define the "draw" motion as follows: moving Baxter's gripper over the deck, lowering, applying suction, lifting, moving to a part of the reachable workspace away from the zone of play (the hand), lowering, stopping suction, and returning to BEV. The "play" motion is defined similarly as moving Baxter's gripper over a card in its hand, lowering, applying suction, lifting, moving to the saved location of the zone of play (hardcoded as a location to the right of the deck), lowering, stopping suction, and returning to BEV. Each movement of Baxter's arm is planned by MoveIt, confirmed by a human, and executed by the PID controller from Lab 7.
The gameplay_node, after initialization, "draws" several cards, then waits for a human to flip over the first card into the zone of play.
The gameplay_node calls \twod_to_3d to identify the card in the zone of play, goes through the list of cards that are in its hand, and "plays" a card with the same number or suit as the card in the zone of play if possible. Otherwise it "draws" a card.
After its move, Baxter waits for operator input to tell it when its opponent has finished playing. It then observes the state of the game with \twod_to_3d and if it sees that the new card in the zone of play is illegal based on the previous card played (meaning his opponent cheated) then gameplay_node renders a big frowny face on Baxter's head.
Then we repeat the gameplay loop (steps 1 through 4) until either Baxter or its opponent is out of cards to play (and wins).