TRANSLATOTRON 2

Translatotron 2 is a robust direct speech-to-speech translation (S2ST) model from Google AI. It is an improved version of Translatotron, the world's first S2ST model that can translate speech directly between two languages and also retained the source speaker's voice in the translated speech

Translatotron 2 uses a novel method for transferring the source speakers’ voices to the translated speech, which works well even for input speech containing multiple speakers speaking in turns and also reduced the potential for misuse. Translatotron 2 outperforms Translatotron by a large margin on translation quality, speech naturalness, and speech robustness.

Need for a better S2ST

Speech-to-speech translation (S2ST) helps solve the language barrier problems between people all over the world. Automatic S2ST systems consist of a cascade of speech recognition, machine translation, and speech synthesis subsystems. But they suffered from longer latency, loss of information ( paralinguistic and non-linguistic information), and compounding errors between subsystems.

Translatotron introduced by Google AI in 2019 consisted of a single attentive sequence-to-sequence model for S2ST, trained end to end and was independent of the intermediate text representation. But Translatotron's performance was not on par with a strong baseline cascade S2ST system (e.g., composed of S2ST model-1, model-2, followed by a Tacotron 2 TTS model).

Translatotron 2 - Components

Translatotron 2 has 4 components :

Speech encoder
Target phoneme decoder
Target speech synthesizer
Attention module that connects them together

The combination of the encoder, the attention module, and the decoder is similar to a direct speech-to-text translation (ST) model.

The synthesizer is conditioned on the output from both the decoder and the attention.

3 Key Improvements made in Translatotron 2

1.The output from the target phoneme decoder is :

Used only as an auxiliary loss in Translatotron
Used as one of the inputs to the spectrogram synthesizer in Translatotron 2 making it easier to train with better performance

2. The spectrogram synthesizer in the original is:

Attention-based (like Tacotron 2 TTS model) in Translatotron and thus suffers from the robustness issues exhibited by Tacotron 2.
Duration-based (like Non-Attentive Tacotron) in Translatotron 2, and thus the robustness of the synthesized speech is improved.

3. Attention-based connection to the encoded source speech is driven by the:

Spectrogram synthesizer in Translatotron
Phoneme decoder in Translatotron 2. Thus, the acoustic information that the spectrogram synthesizer sees is aligned with the translated content that it’s synthesizing and hence each speaker’s voice is retained across speaker turns.

VOICE RETENTION

Responsible Voice Retention

Translatotron could retain the source speaker's voice in the translated speech, by conditioning its decoder on a speaker embedding, generated from a separately trained speaker encoder. This enabled Translatotron to generate the translated speech in a different speaker's voice (if a clip of the target speaker's recording were used as the reference audio to the speaker encoder, or if the embedding of the target speaker were directly available). Though this was a powerful approach, it could be misused to spoof audio with arbitrary content. Hence it was a major concern to deploy the model in production.

But Translatotron 2 uses only a single speech encoder, which is responsible for both linguistic understanding and voice capture. Thus, the model cannot be misused to reproduce non-source voices. This approach can also be applied to Translatotron.

Voice Retention Dataset

To retain speakers' voices across translation, researchers usually train S2ST models on parallel utterances with the same speaker's voice on both sides. This type of dataset with human recordings on both sides requires a large number of fluent bilingual speakers & hence difficult to collect

PnG NAT is a TTS (Text to Speech Synthesis) model which can do cross-lingual voice transferring to synthesize such training targets. The modified PnG NAT model used in Translatotron 2 incorporates a separately trained speaker encoder in the same way as in Translatotron and hence capable of zero-shot voice transference and thus the bilingual dataset problem is solved.

Individual Voice Retention with ConcatAug

For input speech with multiple speakers speaking in turns, each speaker’s voice in the translated speech can be retained by the S2ST models by a novel technique called ConcatAug.

ConcatAug augments the training data on the fly by:

1.Randomly sampling pairs of training examples and

2. Forming new training examples by concatenating the

source speech
target speech
target phoneme sequences

The resulting samples contain two speakers’ voices in both the source and the target speech.Thus the model can learn on examples with speaker turns.

Performance

Translatotron 2 outperforms Translatotron by large margins, such as :

Higher translation quality (measured by BLEU, where higher is better)
Speech naturalness (measured by MOS, higher is better)
Speech robustness (measured by UDR, lower is better)
It excelled on the more difficult Fisher corpus.
The translation quality and speech quality of Translatotron 2 is on par with strong baseline cascade system
Its speech robustness is better than the cascade baseline

Translation quality (measured by BLEU, where higher is better) evaluated on 2 Spanish-English corpora

Speech naturalness (measured by MOS, where higher is better) evaluated on 2 Spanish-English corpora

Speech robustness (measured by UDR, where lower is better) evaluated on 2 Spanish-English corpora

Multilingual S2ST

In a multilingual set-up, Translatotron 2 model took speech input from 4 languages and translated them into English. The language of the input speech was intentionally not provided, forcing the model to detect the language by itself. It outperformed Translatotron by a large margin. The results are not directly comparable between S2ST and ST, but the close numbers suggest that its translation quality is comparable to a baseline speech-to-text translation model. Thus, Translatotron 2 performs well on multilingual S2ST.

Performance of multilingual X => En S2ST on the CoVoST 2 corpus

Hope you are doing well ... Pleasure meeting you online ...

I am Sri Lakshmi , AI Practitioner, Developer & Technical Content Producer

Liked this article ??? If you want me to : Write articles that give simple explanations of complex topics / Design outstanding presentations / Develop cool AI apps / Launch and popularize your products to the target audience / Manage social media and digital presence / Partner or Collaborate with me, feel free to discuss with me your ideas & requirements by clicking the button below

Let's work together

Date : 8 October, 2021

Author : Sri Lakshmi

Reference : Google AI Blog

Further Reading: View the Translatotron 2 paper here

Update your knowledge with my other interesting Articles

GSLM SoundStream Codex GAN BlenderBot 2 Triton