The following figure displays the workflow of our robot starting from the initial sensing phase.
We first needed to identify the location of the table we were putting our pieces on in the image seen by Baxter's camera. We found color segmentation to be unnecessary on the wooden planks that we used as our tabletop, as the quality of the camera made the planks very bright and essentially white in color. We instead turned the image to grayscale and used the OpenCV function cv2.threshold to find the brightest pixels in order to identify the table.
We relied on the colors of the pieces to distinguish between the different shapes. We took a screenshot of the image received from the camera on Baxter's hand and used a HSV trackbar to identify the the appropriate HSV ranges for each color. To isolate the correct pixels, we used the ranges we found to pass into the OpenCV function cv.inRange. We took the generated "mask" image (right) and used it to find the dimensions of our pieces.
The masks generated by thresholding and color segmentation allowed us to use cv.findContours to identify the set of pixels which surround the object we are trying to isolate. Using the function parameter CV_CHAIN_APPROX_SIMPLE, we compressed these points into edges, leaving only the endpoints of each line segment. These edges and points were then passed into cv.approxPolyDP, which helped us approximate our contours as a polygon.
We took the extreme points of the approximated shape to be the corners of our object and extracted their coordinates. The corners of the table allowed us find the homography matrix for the image, while the subtracting the minimum x and y coordinates from the maximum x and y coordinates and dividing by two allowed us to find the midpoints of each Tetris piece, which were the ideal locations for Baxter to pick them up.
To properly translate into 3D space, we first had to transform the pixel coordinates from the angled image seen by the camera to a straightened image which reflected the real-world dimensions of the table and Tetris pieces. We created a 3600x3000 pixel image (to reflect the proportions of the actual table, which was 36x30 inches) and took the coordinates of its corners. Using these coordinates, the coordinates of the corners from the approximated polygon, and the cv.findHomography function, we generated a homography matrix that transformed the originally skewed corners into the straightened ones. Applying this matrix to the rest of the camera image allowed us to do the same for the all other points.
To convert the pixel coordinates found from homography into the base frame coordinates for Baxter, we used an AR marker to tag the table. Since the table is a flat plane, this allowed us to dynamically find the height of the table (position.z). The AR tag was also placed near the bottom left of the table with respect to Baxter to make it easier to convert from pixel coordinates to the base frame. To find the x and y positions of the individual pieces, we converted the displacement of the centers in pixel coordinates into distance in meters from the center of the AR tag through dimensional analysis, and added those to the x, y position of the marker.
The specific Baxter we used had faulty hardware in that the output of what was supposed to be left hand camera actually came from the right hand, and vice versa. We had been using the left hand camera (by subscribing to the right hand camera topic) and using the right arm to move pieces. However, the robot itself assumed that the cameras were correct, so its position relative to the AR tag changed whenever the right hand moved, even though the actual camera image was stationary. As a result, we couldn't simply use the output position of the AR marker; we had to compose a transform from the camera frame to the base frame with the position of the AR tag to find the true position of the tag with respect to base.
The computer vision portion relies on the pieces being different colors to distinguish between the different piece types. However, in the Kinematics portion, we distinguish between the different piece types by letters that look similar to the piece itself. To get from the color to the type / letter, we perform a mapping from each color to the letter. For example, in the computer vision portion, we see a "purple" piece. But, during the processing before Kinematics, we map that "purple" piece to the corresponding "J" piece. A list of the mapping is as follows:
In order to know what pieces we wanted to pick up and where they were, we needed to receive the output from the CV portion of the project as input into the Kinematics portion. To do so, we received from the CV portion a PoseArray. The PoseArray's frame_id was set to a string of piece colors which we parsed to give us a list of all the piece colors and mapped each color to a piece type (e.g. a "purple" piece corresponded to the "J" piece). The respective poses for each piece was stored in the array itself.
We process the PoseArray to output a dictionary that held the actual pieces and their respective PoseStamped messages. We call this dictionary our "actual" dictionary.
Once we have the dictionary, we can pass that to the actuation functions. For each piece, the PoseStamped contains a Pose, with the position being the position of the piece's center. Our code translates to just above the piece (position.z + 0.2 m), before dipping down to suction up the piece, and returning to just above the piece.
In order to know where we wanted to place the piece, we expect an input string denoting the layout of the ideal configuration. We then interpret the string as a 2D array and figure out the relative ideal positions of the pieces from a certain starting point. To interpret the string, periods indicate a new row, empty spaces are indicated by the letter "G" which doesn't correspond to any piece, and we fill our the 2D array from top to bottom, left to right.
As an example, if our input string is as follows: "J J Z Z G. J O O Z Z.J O O L G.G L L L G", we interpret it as a 2D array (shown below), which gives a visual representation of where we want the piece and what orientation to put it in.
This 2D array is then processed to find the midpoints of the pieces and their respective desired orientation as follows:
Similar to in the Picking portion, we output a dictionary that holds the pieces and their respective desired PoseStamped messages. We call this dictionary our "goals" dictionary.
We also pass "goals" to the actuation functions. For each piece that was just picked up, the PoseStamped contains a desired Pose, with the position being the position of the piece's center. Our code translates to just above the piece (position.z + 0.2 m), before dipping down to release our suction and "place" the piece, and returning to just above the piece. Once the piece is placed, we can pick up the next piece in "actual".