GSLM - Textless NLP

Generative Spoken Language Model ( GSLM ) is the first textless NLP model that generates expressive speech just from raw audio. GSLM is the basic foundation for a future of textless NLP applications for all languages, even for the languages without considerable text datasets.

GSLM is Facebook AI's flagship model which was open sourced on September 9, 2021. It is the first state-of-the-art NLP model that is independent of the text data and instead relies on audio signals.


NLP MODELS - OVERVIEW

NLP models are usually based on text datasets. eg: BERT, ROBERTA, GPT3, etc. They take written text as input and generate realistic text on almost any topic. They also provide pre-trained models which can be fine-tuned for other NLP applications like Translation, Summarization, Information Retrival, Sentiment Analysis. eg: BART, XLM-R.

But these traditional NLP models have a main limitation. They can be used only for languages with large amounts of trainable text datasets.


BYE BYE ASR

Connecting a NLP app to speech input usually requires the training of ASR (Automatic Speech Recognition ) system. ASR has many shortcomings such as huge need for resources causing errors, poor encoding of interactions, models only for few languages.

Textless NLP aims to give a farewell to ASR by directly using speech input to produce speech output.


WELCOME TEXTLESS NLP

Pre-school children are able to learn a language just from audio interactions and raw sensory inputs. Similarly, textless language models will also be able to learn from audio signals & perform NLP tasks. I warmly welcome you into the world of Textless NLPs !


NEED for TEXTLESS NLPs

NLP Technology uses written text to train models. Though it is beneficial for languages with huge amounts of trainable datasets (eg:English), it is not useful for most of the universal languages lacking these extensive datasets. Hence a multidisciplinary team of Facebook AI researchers with expertise in signal processing, speech processing, NLP, and psycholinguistics worked together & built GSLM to make NLP technology beneficial to everyone, including the non-English speakers.


BENEFITS of TEXTLESS NLPs

1.Textless NLP models makes AI more inclusive with a rich variety of spoken languages

2. Access to the full expressivity of oral language makes textless NLP models models to work better than traditional NLP models, even in text-rich languages like English. Oral language

        • Incorporates nuances & intonations

        • Encodes irony, anger, & uncertainty

        • Uses vocalizations like laughter, yawning & mouth clicks

3. Researchers can enjoy Annotation-free & ASR-free training of NLP models. Audio data can be easily extracted from podcasts, radio shows & social audio apps

4. Developmental psychologists and speech & language clinicians will be able to predict the effect of linguistic input variances on infant's ability to learn to speak and to understand speech

5. NLP Researchers will have the ability to:

      • pretrain models with a simple, next sound unit prediction task

      • fine-tune them for end-to-end tasks without any need for text

6. Textless NLP opens up the possibility of a set exclusively novel range of audio applications, such as online expressive translation for multilingual video games, content search and summarization from archived audio

7. With more improvements, we would have textless versions of standard NLP tasks, such as sentiment analysis, document retrieval, summarization, etc.


COMPONENTS OF GSLM

The Baseline GSLM model consists of 3 main components:

1. Encoders : CPC, wav2vec 2.0 & HuBERT (3 Encoders were tested to find the best encoder for model training)

2. Unit-Based Language Model (uLM) : Standard causal Transformer

3. Decoder : Tacotron 2, a standard text-to-speech system


COMPONENTS FUNCTIONS

Encoder : It converts speech into discrete units which represent frequently recurring sounds in spoken language

Then, k-means clustering and deduplication (removes successive identical units) were performed

Autoregressive uLM : It is trained to predict the next discrete unit based on what it’s seen before

Decoder : It converts units into speech

TRAINING

Datasets:

Encoder and uLM were trained on 6,000 hours of

  • Libri-Light

  • Librispeech (a large collection of audiobooks)

Decoder was trained on

  • Librispeech

  • LJspeech

The entire stack was trained with self-supervision from raw audio ( without any text or labels )

Language model and text-to-speech components were trained on pseudo-text derived from that raw audio


EVALUATION

Good models use >=100 units and encode speech stretches shorter than phonemes (smallest unit of speech distinguishing one word from another, as the element p in “tap,” which separates tap from “tab” “tag” & “tan”). Analysis of pseudo-text was not possible as the units didn’t map one-to-one with letters or phonemes.

Hence a pretrained ASR was used to convert the generated audio back to text. ASR helped to measure the :

  • Intelligibility of the resynthesized audio using PER (phoneme error rate, compares the phonemes of the original input with the phonemes retranscribed by the ASR )

  • Linguistic quality and diversity of the generated audio (conditional / unconditional) using AUC ( area under the curve ) metric

How to get AUC ? Using a new novel measuring technique called Degree of Inventivity. To find the degree of inventivity of a language model the sentences are sampled across a range of “temperatures

      • The lower the temperature, the more rigid a model is

      • The higher the temperature, the more variable the model is

RESULTS


Unconditioned Samples

The following are the unconditionally generated samples from the best models (CPC or HuBERT on 100 units), trained on Libri-Light 6k. More samples are available here


With a low temperature, sentences are repetitive (the transcriptions are made by an ASR)

Generation (temperature: 0.3) THE PROPERTY BY JAMES RESELL RED FOR LIBERATA OR BY JASON DOWNY THE PROPERTY BY JASON DOWNY THE PROPERTY THE PROPERTY THE PROPERTY THE PROPERTY


With a medium temperature, they become locally coherent (for a few words) and more varied

Generation (temperature: 1.0) BUT IT IS ATTENDANT FROM THE PEOPLE TO DEFEND HIMSELF FROM THIS INFORMATION PRIDE OF THE POTENTIAL IN CRIMINAL ACTIVITY A CURIOSITY AND IMPETUOSITY OF THE WORLD A WAR SOON ACQUIRED


With a high temperature, they are quite varied but become incoherent. Some passages aren’t composed of actual words

Generation (temperature: 1.5) ATION OF PURE BLUE HE SAID AT ONCE A LICKING STREAMY AT HER WARM SPOT OF HALF PERFORMED NOTE WAS A RAGING OATH LET IT AS BIR OF AMOLE IN MOOD STROLLING ER CRASS


Conditioned Samples

Lets take a look at an example of a generated continuation conditioned on the prompt "This reality begins to explain the dark pow[..]" (from 'Introduction to J. Verne’s Twenty Thousand Leagues Under the Sea' by P.F Walter) using a medium temperature (HuBERT 100).


Prompt Continuation THIS REALITY BEGINS TO EXPLAIN THE DARK POWER OF THE MAGICAL BLACKNESS AND IN THE MIDST OF IT IS MAGICAL AS A SINGLE BLACKNESS OF THE PAIN


The model is able to

  • Complete an incomplete word (pow[..] → POWER)

  • Continue using words in the same general mood (dark→ BLACKNESS).

  • Repeat itself (MAGICAL)


PROSODY

The units discovered by encoders are not phonemes, but they have properties similar to phonemes. They encode phonetic contrasts (like differentiating between “pa” and “ba”). They ignore speaker, channel information and prosody . Prosody denotes the expressive speech properties like rhythm, stress and intonation.

Prosody Capturing by T2S

To capture the prosody, an improved simplified text-to-speech (T2S) system is used. T2S has a variational autoencoder utilizing vector quantization (VQ-VAE), fed with pitch (F0) as input. VQ-VAE is trained to acquire a unique latent representation.

The T2S consists of encoder region in the left side which encodes

  • pseudo-text units on the top left

  • quantized pitch units in the middle

  • speaker embeddings in the bottom

The T2S consists of a decoder on the right, which reconstructs the waveform.

T2S Evaluation

This T2S model was evaluated on LJspeech (single speaker) and VCTK (multispeaker) . The HuBERT-based units provide very good results both for objective metrics and subjective evaluation scores.

CONTENT + PROSODY (Joint Modelling)

To jointly model the content aspect & prosodic aspect of speech, a novel multistream causal Transformer is used, where the input and output layers have multiple heads, one for each channel.

Three speech channels are being chosen to model:

  • Pseudo-phone units

  • Duration

  • Quantized pitch

Similar to the baseline model, this prosodic-GSLM is trained from the raw waveform of audiobooks. Extra channels & tasks are added to improve the model's performance in terms of perplexity scores of the units.

Jointly trained model can now generate :

  • Multiple realistic prosodic “inpainting” for the same prompt (imposing : phonetic content, sample only duration & pitch).

  • Novel content & prosody congruently with the prompt’s expressive style.

Here are continuations of the prompt "When an aristocracy carries on public affairs, its [..]" (from a rather formal rendering of Alexis de Tocqueville’s political essay Democracy in America)

Here are continuations from the prompt "She was quite shocked when I asked her whether wine was allowed [..]" (from an expressive rendering of Jane Austen’s novel Mansfield Park )

More examples can be found here


APPLICATIONS

1. S2ST

GSLM has enabled the first audio-only speech-to-speech translation (S2ST) system

2. Voice Transfer

The working of speech and prosodic units is highly independent of the speaker. Hence, the model can perform voice transfer by changing the output speaker embedding while preserving the original input's phonetic units & prosody

3. Speech Codec

GSLM also be used as a speech codec, transmitting only a voice embedding and the discrete codes for units and prosody. GSLM is specialized only to speech and cannot encode other audio types( eg : music ).

It performs well even at a lower bit rate :

  • 20x compression factor compared with Opus (audio codec)

  • 2x when compared with VQ-VAE (speech codec)

Examples of voice transfer & voice codec use cases are available here

Subjective resynthesis score (MUSHRA, higher is better) as a function of bit rate (lower is better) for different codecs.

GSLM's super-low bit rate unsupervised codec is in green.

Date : 20 September, 2021

Author : Sri Lakshmi

Reference : Facebook AI's GSLM

Hope you are doing well ... Pleasure meeting you online ...

I am Sri Lakshmi , AI Practitioner, Developer & Technical Content Producer

Liked this article ??? If you want me to : Write articles that give simple explanations of complex topics / Design outstanding presentations / Develop cool AI apps / Launch and popularize your products to the target audience / Manage social media and digital presence / Partner or Collaborate with me, feel free to discuss with me your ideas & requirements by clicking the button below

Further Reading: View the Code + Pretrained models of GSLM here

Update your knowledge with my other interesting Articles

SoundStream Codex GAN BlenderBot 2 Triton