Code
Model and code will be open-sourced on due date.
Model and code will be open-sourced on due date.
In SpeakerLM, we introduce three types of speaker registration mechanisms for the SDR task: No-Regist, Match-Regist, and Over-Regist. During training, all samples are loaded in a matched registration form by default. For each training batch, we sample a random number from a uniform distribution between 0 and 1 to determine the registration type. If the number is less than 1/3, we retain the matched registration (Match-Regist); if it falls between 1/3 and 2/3, we remove all registered speakers from the prompt (No-Regist); if it exceeds 2/3, we randomly sample 1 to 50 speakers from other sessions and append them as redundant registered speakers (Over-Regist).
1. SD+ASR+LLM
In the SD+ASR+LLM pipeline, we employ a text-based LLM to correct the speaker labels generated by the SD+ASR front-end. The prompt used for the LLM follows previous work.
LLM Prompt in SD+ASR+LLM: You are a helpful assistant. In the speaker diarization transcript below, some words are potentially misplaced. Please correct those words and move them to the right speaker. Directly show the corrected transcript without explaining what changes were made or why you made those changes.
2. SpeakerLM-ASR
In the first training stage of SpeakerLM, we use pure ASR data to enhance the model's ASR performance. We refer to this model as SpeakerLM-ASR. The LLM prompt is:
LLM Prompt in SpeakerLM-ASR: You are a helpful assistant. Transcribe the speech. <start>path to the input speech<end>
3. SpeakerLM
In SpeakerLM, the LLM prompts vary depending on the registration mechanism. Here, we present the prompt designs for three different registration scenarios, i.e., No-Regist, Match-Regist and Over-regist. Suppose the ground truth contains three speakers: Mike, Lucy, and Jack. The corresponding prompts are constructed as follows.
No-Regist: You are a helpful assistant. Transcribe by roles. <start>path to the multi-speaker speech<end>
Match-Regist: You are a helpful assistant. Registered Speaker Embeddings: Mike<start>path to the embedding of Mike<end>; Lucy<start>path to the embedding of Lucy<end>; Jack<start>path to the embedding of Jack<end>; Transcribe by roles. <start>path to the multi-speaker speech<end>Â (There are no specific requirements about the speaker order.)
Over-Regist: You are a helpful assistant. Registered Speaker Embeddings: Mike<start>path to the embedding of Mike<end>; Lucy<start>path to the embedding of Lucy<end>; Jack<start>path to the embedding of Jack<end>; Andy<start>path to the embedding of Andy<end>; Rose<start>path to the embedding of Rose<end>; Frank<start>path to the embedding of Frank<end>; Transcribe by roles. <start>path to the multi-speaker speech<end> (PS: In this case, Andy, Rose and Frank are the over-registered speakers from other sessions.) (There are no specific requirements about the speaker order.)