For this project, we chose to implement a brain-like architecture to be able to handle all of the required tasks.
The task manager node acts as the brain of our system, by integrating all inputs and outputs from the nodes.
It is able to decompose high-level instructions into a sequence of primitive actions, and activates the different nodes accordingly.
By maintaining a task queue and state status, it makes sure information is processed in the correct order and that there is no conflict between the nodes.
The speech node transforms an audio user input into a JSON instruction file.
The Google Cloud Speech-To-Text API outputs a transcript of the recorded audio file.
This transcript is then parsed into a dictionary with four entries: task type, object, color and destination.
Equivalency classes have been implemented to understand most command types from the user. For example, the retrieval task will be selected for any of the following verbs: "Retrieve", "Find", "Bring", "Fetch","Take",...
The vision node utilizes the Google Cloud Vision API and processes two types of commands based on speech input:
Object-Based Recognition (when an object is specified):
Detects all objects in the RGB image.
If the identified object's class matches the requested object, it computes the coordinates using depth information and a pinhole camera model.
Color-Based Recognition (when a color is specified):
Detects all objects in the RGB image and identifies their color using Otsu’s method for masking and conversion to HSV.
If an object matches the requested color, it returns the first detected object of that color.
Then, it runs the object-based recognition process.
Finally, the system transforms the coordinates into the arm's frame and publishes them for further processing.
The navigation node is activated once a target object pose is received by the task manager.
It computes a proportional control input based on the pose error.
The velocities are clipped to a maximum of 0.2m/s for linear velocity and 0.2 rad/s for angular velocity, to ensure a smoother movement.
The Locobot stops when it is in a radius of 0.2m from the target pose, facing the object.
The manipulation node activates when the navigation node signals that the goal has been reached.
It receives the target object's position coordinate which is transferred from the camera to the arm base.
Once the task is complete, it returns a Boolean value to indicate the node's completion
A slight adjustment to the end-effector's pitch and roll is needed to ensure a precise grip on the object.