Overview of System
Overview of System
Structure of Communication
Depiction of Communication between Nodes
Our architecture consisted of six main ROS2 Nodes: a Speech-to-Text Node (for processing speech commands), a Vision Node (for finding pixel coordinates of objects and baskets), a Palm Detection Node (for finding pixel coordinates of hands), a Control Node (responsible for keeping track of the robot's state and determining all motions), an Arm Wrapper (a wrapper written by the TAs that utilized Trossen's functions for opening/closing the gripper and for moving the end effector to a desired pose), and a Base Wrapper (a wrapper written by the TAs for moving the base at a desired velocity).
As shown, the Control Node was the central node, telling all other nodes when to run. At the start of a task (a task being defined as the sequence of grabbing and placing of an object), the Control Node would ask the Speech-to-Text Node to record audio via the /record_audio topic. After transcribing audio, the Speech-to-Text Node would tell the Vision Node the desired object (ex. red cube) via the /desired_object topic and tell the Control Node the place to put the object (ex. hand or basket) via the /desired_item_location topic. Since the Vision Node performed all object detection, the Control Node only needed to know whether it was time to find the object or to find the place to put the object. When it was time to find the object, the Control Node would tell the Vision Node to scan the image for the object (via the /scan_image_request topic). The Vision Node responded over the /object_report topic using a Point message, which had the x and y coordinates set to the pixel coordinates if the object was in frame, and z coordinate set to 0 if the object was not in frame (the z-coordinate was set to 1 if the object was in frame). When it was time to detect where to place the object, the Control Node contacted either the Vision Node over the /scan_basket_request topic if the desired location was a basket, or the Palm Detection Node over the /scan_hand_request topic if the desired location was a hand. The /hand_report topic was structured identically to the /object_report topic, and the Control Node was designed to be agnostic as to whether the /object_report topic was referring to an object or basket. The /armpose topic was used to communicate the desired pose of the end effector to the arm wrapper, and the /gripper topic was used to communicate whether the gripper should be opened or closed. The /locobot/mobile_base/cmd_vel topic was used to communicate the desired speed of the base.
State Machine
State Machine for System
Description of Actions in State Machine
Since the robot's motion was so closely tied to the feedback from all of the perception nodes, it was decided to integrate the system's State Machine inside the Control Node. The first step in the completion of a task was to request audio input as to what object to grab and where to place it. Once this information was received, the Vision Node was asked to look for the desired object. If the desired object was not found, the robot would rotate 10 degrees and ask the Vision Node to search again. This sequence would repeat until the Vision Node reported pixel coordinates for the desired object. After converting these pixel coordinates to real world coordinates in the Base Frame, the robot would judge whether the object was close enough to grab, with "close enough" being defined as being located less than 10 degrees off the robot's X-axis (the forward axis) and no more than 0.5 meters away from the base. If the object was not close enough, the robot would rotate and/or drive forwards to put the object close enough. To verify that these motions put the object in a reachable location, the Vision Node was asked again to provide pixel coordinates of the object. The sequence would repeat if the object was still not within reach. Once the object was within reach, the robot would execute the grasping sequence.
The State Machine was modularized such that the sequence of steps to place the object was nearly identical to that for grasping an object. The only minor differences were that the Speech-to-Text Node was not called and that communication with the Vision Node was altered, with the Control Node communicating with the Vision Node to look for the basket if a basket was the requested place for the object, or communicating with the Palm Detection Node instead if a hand was the requested place.
The State Machine was also designed to run repeatedly. In other words after successful placement of an object, the Control Node would request a new task from the Speech-to-Text Node, meaning the robot could grab and place objects indefinitely.