Soundstream - AI Codec

SoundStream is an AI audio codec for speech and music, producing higher quality audio. It is the first neural network audio codec that is capable to run in real-time on a smartphone CPU.

SoundStream was introduced by the Google AI team on August 12, 2021. It can be used for different sound types like clean speech, noisy and reverberant speech, music, and environmental sounds. Just with a single trained model, it delivers state of the art quality over a broad range of bit rates


CODEC

The word Codec is derived by merging 2 words,' Encoder and Decoder '. Codec is used to compress files and enable easy distribution.

Encoding a file means removing extra information without compromising the quality. Thus encoding reduces the file size using complex mathematical functions.

Decoding reverses this math and play back the encoded file.


AUDIO CODEC

Audio Codec is standard or protocol to encode and decode audio.

Eg: MP3, WAV, AAC, Opus, EVS, etc...


OPUS & EVS

Opus supports bitrates of range 6-510 kbps. It is deployed in applications like Google Meet and YouTube. Opus is open sourced.

EVS (Enhanced Voice Services) supports bitrates of range 5.9-128 kbps. It is developed by 3GPP for telephony.

In both Opus & EVS, the reconstructed audio quality is

  • excellent at medium-to-low bitrates (12–20 kbps), but

  • degrades sharply at very low bitrates (⪅3 kbps)

Though these manual audio codecs achieve high quality by, AI researchers are trying to replace handcrafted signal processing pipelines with machine learning codecs, built using data.


COMPONENTS OF SOUNDSTREAM

SoundStream neural network consists of 3 components, trained simultaneously as a single network(end-to-end).

  • Encoder converts the input audio stream into a coded signal

  • Residual Vector Quantizer (RVQ) compresses the coded signal

  • Decoder converts the compressed signal back to audio



TRAINING SOUNDSTREAM

Training involves a discriminator that computes a combination of adversarial and reconstruction loss functions. This induces the reconstructed audio to sound like the uncompressed original input.

Once the entire model is trained, the encoder and decoder can be run on separate clients to efficiently transmit high-quality audio over a network.

SoundStream learns

  • audio codec from data

  • scalable codec from RVQ


RVQ

Residual Vector Quantizer (RVQ), has upto 80 layers. The first layer quantizes the code vectors with moderate resolution. Subsequent layers processes the residual error from the previous one. The quantization process is split in several layers reducing the size of the codebook. The infinite set of vectors produced by Encoder are replaced by the finite set of the codebook vectors. This process is called Vector Quantization.

Eg: For 100 vectors per second at 3 kbps, the codebook size is reduced from 1 billion to 320 by using 5 quantizer layers.

A novel method called “quantizer dropout” has been proposed to make SoundStream scalable. Bitrate can be increased or decreased by adding or removing quantizer layers. As the number of quantization layers controls the bitrate, some quantization layers are dropped to simulate a varying bitrate during training. Thus, decoder can perform well at any bitrate of the incoming audio stream. Thus, a single trained model can operate at any bitrate , adjusting the varying network conditions encountered while transmitting the audio.


PERFORMANCE

SoundStream at 3 kbps outperforms Opus at 12 kbps

With 3.2x–4x fewer bits (lower bandwidth), it approaches the quality of EVS at 9.6 kbps, while using.

At the same bitrate, SoundStream outperforms the current version of Lyra, which is based on an autoregressive network.

The demonstration of SoundStream’s performance compared to Opus, EVS, and the original Lyra codec is available here



COMPRESSION + ENHANCEMENT

In SoundStream, compression and enhancement is carried out jointly by the same model, decreasing the overall latency. Compression is combined with background noise suppression, by activating and deactivating denoising dynamically


LYRA + SOUNDSTREAM

Google's Lyra is already deployed and optimized for production usage. SoundStream is still at an experimental stage. In the future, Lyra will incorporate SoundStream components to provide both higher audio quality and reduced complexity.


SOUNDSTREAM RELEASE

SoundStream will be released as a part of the improved version of Lyra, providing both flexibility and better sound quality. SoundStream TensorFlow model will soon be released for experimentation.

SoundStream is an amazing machine learning-driven audio codec. It outperforms sota codecs like Opus and EVS, can enhance audio on demand, and requires deployment of only a single scalable model.

Hope you are doing well ... Pleasure meeting you online ...

I am Sri Lakshmi , AI Practitioner, Developer & Technical Content Producer

Liked this article ??? If you want me to : Write articles that give simple explanations of complex topics / Design outstanding presentations / Develop cool AI apps / Launch and popularize your products to the target audience / Manage social media and digital presence / Partner or Collaborate with me, feel free to discuss with me your ideas & requirements by clicking the button below

Reference: Google AI's SoundStream

Update your knowledge with my other interesting Articles

GSLM Codex GAN BlenderBot 2 Triton