State Machine
Pixel-to-Real-World Coordinate Transformations
Control Node
Data Collection and Labelling
YOLO11n and CNN Model Training
Vision and Speech to Text Nodes
Pipeline for Vision/Audio Node
Handling of YOLO Boundary boxes for CNN Model
Open-CLIP for label/user input encoding and comparison
Gemini Speech to Text API for user input command
MediaPipe Hand Landmark for palm detection
For this project, we were tasked to use natural communication methods to communicate an intended task to a Locobot robot. One of the goals of this project is to use robot perception to perceive the environment, audio to hear from the person, and use existing libraries (or those you may develop) to take this input and determine the task the human is requesting. Then using computer vision, we need to perceive the environment, control the motion of the robot and complete the requested task.
The three tasks that we must complete are
Object Retrieval: Teams will use language to request the object and bring the desired object back to a particular location
Sequential Task: In this task, teams may use natural language to request a series of actions be performed in the environment
Group Chosen Task: In this task, the team will be given the freedom to pick a collaborative task
For the third task our group chose to implement a drop off task where the robot would hand over the object they retrieved into a human's hand. We were able to successfully program the Locobot to parse an audio command by a human to retrieve an object and place it in a specified location.
The source code can be found here. The reason for the code being in Google Drive (as opposed to GitHub) is because Google disabled our API Keys when we pushed code containing the Keys to GitHub.