The robotic assistant was developed by programming a Locobot using ROS2. The ROS node architecture used is depicted below.
Altogether, the general node architecture above allowed the Locobot to process user speech, identify objects in its environment, navigate to desired objects, and grasp them. In particular, each node accomplished the following:
The PERCEPTION Node was the main driver. Its camera callback function ran a state machine described below which located a desired object in the frame and posted its pixel coordinates and depth to the Localizer node. It also posted messages to the Manipulation and Driver nodes. These messages told the Driver node when to turn to locate an object, when to drive towards an object, and when to stop. Similarly, they told the Manipulation node when to home the arm so as not to block the camera, when to execute a grasp, and when to deliver a retrieved object.
The Localizer Node subscribed to relevant camera intrinsics and transformation matrices to determine and publish the desired object's coordiantes in the base frame. These coordinates were used by the Driver and Manipulation nodes to determine how to manuever the base and arm to reach the object.
The Navigation Node controlled the base of the robot in order to bring objects into the camera view and move the Locobot within grasping range of the desired object.
The Manipulation Node controlled the Locobot's arm, moving it out of the camera view during object localization and executing grasps. The complexity of the grasp manuever varied depending on the task as described below.
This node processed the camera inputs in order to identify desired objects and coordinate actions of the Navigation and Manipulation nodes. Thus, the main function in the Perception Node was its camera callback. This function determined if the image contained the desired object; if so, it determined and published the pixel coordinates and depth of its center (u, v, d) to the /target_point topic. This was done using the Google Vision API. Furthermore, it informed navigation and grasping with the following state machine:
In this state machine:
The initialization sequence first records and transcribes speech from the user using Google's Text-to-Speech API. Then, using Google Gemini, it parses out the desired object that the user wants and whether to simply deliver or rather pour the object from the transcribed audio. It then moves the arm to the home position and out of the camera view by posting the "wait" state to the Manipulation node. Next, the machine moves to the RotateFind state.
In the RotateFind state, if the desired object is not found, continued base rotation is commanded by posting the "turn" state to the Navigation node. However, once the desired object is found, its pixel coordinates and depth are published, and driving towards the object is initiated through publication of the "go" state to the Navigation Node. The machine then moves to the Drive2Obj state.
In the Drive2Obj state, the "go" state is continuously published until the desired object is found to be within reach of the arm per a distance threshold. Then, the base is stopped by publishing the "stop" state to the Navigation node, and the "grasp" state is published to the Manipulation node.
In the Grasp state, the desired object's coordinates are continuously published for accurate localization.
This node converts the pixel and depth information (u, v, d) for the desired object center posted by the Perception node into spatial coordinates (x, y, z) of the object in the arm's base frame, which are needed by the Manipulation and Navigation nodes for driving to and grasping the object. This was achieved through the following steps:
Retrieval of the transformation matrix T from camera_color_optical_frame to locobot/arm_base_link.
Retrieval of the RGB camera intrinsics K from the /locobot/camera/color/camera_info topic via subscription.
Retrieval the desired object pixel and depth coordinates (u, v, d) from the /target_point topic via subscription.
Computation of the desired object spatial coordinates in the camera frame p as x = d(u - K[0, 2]) /K[0, 0], y = d(v - K[1, 2]) /K[1, 1], and z = d.
Convert to the arm base frame using r = Tp where p is the vector of camera frame coordinates and r is the vector of arm base frame coordinates.
The Navigation node consists of three phases. The Perception Node determines the object's location and selects one of the three phases, sending the corresponding command.
1. TURN Phase
The robot starts with its arm positioned outside the camera's view.
It executes a TURN command by applying only a nonzero z-rotation, rotating in place.
When an object is detected, the perception node executes a STOP command.
2. STOP Phase (After Detection)
Upon detecting an object during the TURN phase, the robot stops by setting its velocity to zero.
This ensures a stable transition before movement begins.
3. GO Phase
The robot computes the angle error between its base x-axis and the object using atan(y, x).
Movement strategy:
If the angle error exceeds the threshold, the robot rotates toward the object using a nonzero z-rotation in the appropriate direction.
If the angle error is below a predefined threshold, the robot moves straight using a nonzero x-velocity.
Once the object is within the threshold range, the robot stops by setting its velocity to zero.
The Manipulation Node responded to two commands:
"wait", to which the node responded by moving the arm to home position to prevent blocking of the camera. This was achieved by publishing to the /go_home topic.
"grasp", to which the node responded by moving the arm to grasp an object. The details of the grasp varied depending on the task
For a basic grasp of a small object such as a fruit, the following action sequence was commanded.
The arm was moved above the object with the gripper open and oriented down.
The gripper lowered onto the object.
The gripper was closed to grasp the object
The arm and object were raised together.
For a medicine pour, grasping was achieved through the following action sequence.
The arm was moved above and behind the object with the gripper open and oriented down.
The gripper orientation was changed to face in front of the robot.
The gripper was lowered down behind the object.
The gripper was moved forward onto the object.
The gripper was closed to grasp the object.
A pour was executed by rotating the gripper 90º along an axis parallel to the floor.
The node controlled the arm by publishing to topics which led to arm motion via the coursewide Locobot arm wrapper.