The first part of parsing user commands for our robot system was speech to text. We implemented a speech-to-text system that captures audio from a microphone, utilizes the Google Cloud Speech-to-Text API to transcribe the audio, and then displays the resulting text. It configures audio recording parameters, records a short duration of audio using PyAudio, and then sends the recorded audio data to Google Cloud's speech recognition service. The resulting text transcription is then passed along the pipeline for further parsing.
After obtaining the text transcription we then pass it to Gemini along with a prompt telling it to extract the desired object and location. The prompt is as follows:Â
"Extract the object to be found and the object to place it into from the following text. Return the results in the format: "object to find: <object>, object to place: <object>". For example, if the text is "Find the red cube and place it in a basket", return "object to find: red cube, object to place: basket"
The extracted object and location is then stored in the /desired_object and /desired_item_location topics. Figure 1 below is an example output: