Task 1
Task 2
Our code had many moving parts that were integrated but consisted of the following modules packaged into various nodes. A diagram representing all the different components working together can be seen above in Figures 1 and 2.
The audio recorder node accesses the microphone in the LoCoBot and records it for 5 seconds, saving a WAV file of the sound. The audio transcriber receives the WAV file and utilizes Google Cloud’s Speech V1P1 Beta1 model to turn the speech into text, which is then passed into the object detection module.
The object detection function takes in an image from the LoCoBot and the audio text from the transcriber node. Then the object detection node utilizes CLIP and YOLO to produce the object center coordinates in pixel coordinates.
The coordinate transformation function takes in the pixel coordinates and runs them through a transformation matrix, producing the coordinates in the base and gripper frames of reference. To convert pixel coordinates into camera frame coordinates, a pinhole camera model was used. Because the depth camera has different intrinsic and extrinsic parameters than the RGB camera, additional preprocessing was performed on the depth image. The reference frame for the output coordinate is decided depending on the state of the LoCoBot, such as goal reached or grasp success. The transformed coordinates are then pushed to either the manipulation or base movement nodes depending on the requirements during the task.
The base movement node intakes real-world coordinates from the perception node in the base-reference frame. The base movement node then has the LoCoBot move to the ascribed coordinates, incrementally stopping and verifying the correct path with the perception node. Using a predetermined step size of 0.10 meters in each x-direction and y-direction helps to reduce error as the LoCoBot moves. Considering the limited workable space for the LoCoBot arm, the base movement node stops 0.40 meters away from the object in the x-direction to enable appropriate grasping.
To further elaborate on the base movement pipeline, the team’s methodology depends on a PoseStamped message passed by the perception node. This message type contains the robot’s 3D positional coordinates and angular coordinates in quaternion format. The code uses the pose information and converts it to velocity and angular velocity messages that communicate to the robot’s built-in base controller node. The robot first navigates to the desired position using a proportional-integrator controller. Then, once at the desired position, it turns to the defined angular orientation using a proportional controller. The algorithm has speed maximum and minimum limitations to avoid overshoot and mitigate errors.
Once the navigation is completed and the object is re-identified, the manipulation node intakes the real-world coordinates of the detected object from the perception node in the arm-base-reference frame. It then utilizes LoCoBot Interbotix API to manipulate the gripper to the coordinates and grasp the object. The manipulation node then looks at the gripper finger position to determine if the grasp is considered complete, and then the manipulation node tells the base movement node to return to its original position. If the grasp fails, the manipulation node will call the perception node again to get the new coordinates of the object.
The integration node acts as the center point for all the codes, launching all the files in the correct order for the task to be completed. Additionally, integration wrote all the publishers and subscribers for each package.