Speech-Copilot:
Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation
2024 Summer
Overview of Speech-Copilot
Abstract
In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing interface based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.
Results
Performance comparison of multiple models and our proposed pipeline across six aspects of Dynamic-SUPERB benchmark.
In-the-wild Multi-task Examples

Based on the left audio, here are results of different baseline models and Speech-Copilot.
Speech-Copilot
Yes, the speaker verification result confirmed that it is Sarah.
Sarah said: "Hi, it's Sarah. I'm calling to let you know I'll be late to the meeting. I'll be there in about 30 minutes. Sorry for the inconvenience."
The background noise includes traffic sounds, suggesting that she might be outdoors near a road or in a vehicle.
The recognized emotion is sad. Combined with the context of her message (apologizing for being late), it suggests she might be feeling a bit upset or stressed about the delay.
Program generated by LLM.
Other Baselines of Large Audio-Language Models
LTU-AS:
Yes, Sarah is apologizing for being late and will arrive in about 30 minutes.
Mistake: LTU-AS only mentioned the content of Sarah's message without addressing the background noise or the emotion.
Correction: LTU-AS should include information about the background noise (traffic sounds) and the recognized emotion (sad).
SALMONN:
Yes, I can take a look at the voice message you received. Based on the audio, it does sound like Sarah's voice. She says "Hi, it's Sarah. I'm going to let you know I'll be late to the meeting. I'll be there in about thirty minutes. Sorry for the inconvenience." There are no clues in the background about where she is or if she seems upset.
Mistake: SALMONN correctly identified the message content but failed to detect the background noise and the emotion.
Correction: SALMONN should mention the traffic sounds in the background and recognize the sad emotion.
Qwen-Audio-Chat:
The voice message is of a woman speaking, saying, "Hi, it's Sarah. I'm going to let you know I'll be late. I'll be there in about 30 minutes. Sorry for the inconvenience." There is a background noise of a busy city environment, such as traffic and people talking. It is difficult to determine if Sarah sounds upset based on the voice message alone, but it is possible that she is running late and feeling flustered.
Mistake: Qwen didn't identify the emotion as "sad" and didn't determine the background noise correctly.
Correction: Qwen should mention that the emotion is "sad" and provide more details on the background noise, such as traffic sounds.
Supplementary Materials
The prompt of task decomposition
You are given the following task instructions, where each instruction may require you to do a speech processing task.
While the required tasks are diverse, they may have some common sub-tasks.
For examples, automatic speech recognition may be helpful for several tasks.
Another example is that some tasks are sound classification tasks, and these tasks may be solved using a single module.
Now, I want you to analyze the instructions, identify the corresponding speech processing tasks, and solve the tasks in the instructions by writing the Python code for each task.
In your code, you should modularize the sub-tasks as a code module, e.g. speech_recognition(audio), query_LLM(prompt), etc.
Then you should finish the tasks by calling the code modules of sub-tasks.
However, as you don't have the official module list, you can create modules when you need. The constraint is that you should reuse the modules as much as possible. And make the size of th module list as small as possible.
This means that if two different tasks require the same sub-task, you should use the same module to finish the sub-task instead of creating a similar one again.
Please keep in mind that you don't need to implement the modules, just define them with a placeholder like speech_recognition(audio), and we will complete the implementation for you.
Besides, here is a module named query_LLM(prompt) that can be used to query the language model for extracting information with proper prompt. You should make use of this module and reduce the number of modules you create.
If the task can be correctly solved by using automatic speech recognition and query LLM, you should use these two modules to finish the task instead of creating a new one.
However, the query_llm module is not omnipotent. It cannot have the audio information, and the speech recognition will only transcribe the speech into text, and some audio information like noise level, or speaker information will be lost during the transcription.
Hence, if the task require audio information beyond the transcription, you should use other modules.
Here is an example:
{This is our example.}
...
The prompt of reflection
There are still some modules that can be combined together.
{...}
Write a new code for the instructions again with the correction we just discussed.