Soundstream - AI Codec
SoundStream is an AI audio codec for speech and music, producing higher quality audio. It is the first neural network audio codec that is capable to run in real-time on a smartphone CPU.
SoundStream was introduced by the Google AI team on August 12, 2021. It can be used for different sound types like clean speech, noisy and reverberant speech, music, and environmental sounds. Just with a single trained model, it delivers state of the art quality over a broad range of bit rates
CODEC
The word Codec is derived by merging 2 words,' Encoder and Decoder '. Codec is used to compress files and enable easy distribution.
Encoding a file means removing extra information without compromising the quality. Thus encoding reduces the file size using complex mathematical functions.
Decoding reverses this math and play back the encoded file.
AUDIO CODEC
Audio Codec is standard or protocol to encode and decode audio.
Eg: MP3, WAV, AAC, Opus, EVS, etc...
OPUS & EVS
Opus supports bitrates of range 6-510 kbps. It is deployed in applications like Google Meet and YouTube. Opus is open sourced.
EVS (Enhanced Voice Services) supports bitrates of range 5.9-128 kbps. It is developed by 3GPP for telephony.
In both Opus & EVS, the reconstructed audio quality is
excellent at medium-to-low bitrates (12–20 kbps), but
degrades sharply at very low bitrates (⪅3 kbps)
Though these manual audio codecs achieve high quality by, AI researchers are trying to replace handcrafted signal processing pipelines with machine learning codecs, built using data.
COMPONENTS OF SOUNDSTREAM
SoundStream neural network consists of 3 components, trained simultaneously as a single network(end-to-end).
Encoder converts the input audio stream into a coded signal
Residual Vector Quantizer (RVQ) compresses the coded signal
Decoder converts the compressed signal back to audio
TRAINING SOUNDSTREAM
Training involves a discriminator that computes a combination of adversarial and reconstruction loss functions. This induces the reconstructed audio to sound like the uncompressed original input.
Once the entire model is trained, the encoder and decoder can be run on separate clients to efficiently transmit high-quality audio over a network.
SoundStream learns
audio codec from data
scalable codec from RVQ
RVQ
Residual Vector Quantizer (RVQ), has upto 80 layers. The first layer quantizes the code vectors with moderate resolution. Subsequent layers processes the residual error from the previous one. The quantization process is split in several layers reducing the size of the codebook. The infinite set of vectors produced by Encoder are replaced by the finite set of the codebook vectors. This process is called Vector Quantization.
Eg: For 100 vectors per second at 3 kbps, the codebook size is reduced from 1 billion to 320 by using 5 quantizer layers.
A novel method called “quantizer dropout” has been proposed to make SoundStream scalable. Bitrate can be increased or decreased by adding or removing quantizer layers. As the number of quantization layers controls the bitrate, some quantization layers are dropped to simulate a varying bitrate during training. Thus, decoder can perform well at any bitrate of the incoming audio stream. Thus, a single trained model can operate at any bitrate , adjusting the varying network conditions encountered while transmitting the audio.
PERFORMANCE
SoundStream at 3 kbps outperforms Opus at 12 kbps
With 3.2x–4x fewer bits (lower bandwidth), it approaches the quality of EVS at 9.6 kbps, while using.
At the same bitrate, SoundStream outperforms the current version of Lyra, which is based on an autoregressive network.
The demonstration of SoundStream’s performance compared to Opus, EVS, and the original Lyra codec is available here
COMPRESSION + ENHANCEMENT
In SoundStream, compression and enhancement is carried out jointly by the same model, decreasing the overall latency. Compression is combined with background noise suppression, by activating and deactivating denoising dynamically
LYRA + SOUNDSTREAM
Google's Lyra is already deployed and optimized for production usage. SoundStream is still at an experimental stage. In the future, Lyra will incorporate SoundStream components to provide both higher audio quality and reduced complexity.
SOUNDSTREAM RELEASE
SoundStream will be released as a part of the improved version of Lyra, providing both flexibility and better sound quality. SoundStream TensorFlow model will soon be released for experimentation.
SoundStream is an amazing machine learning-driven audio codec. It outperforms sota codecs like Opus and EVS, can enhance audio on demand, and requires deployment of only a single scalable model.
Hope you are doing well ... Pleasure meeting you online ...
I am Sri Lakshmi , AI Practitioner, Developer & Technical Content Producer
Liked this article ??? If you want me to : Write articles that give simple explanations of complex topics / Design outstanding presentations / Develop cool AI apps / Launch and popularize your products to the target audience / Manage social media and digital presence / Partner or Collaborate with me, feel free to discuss with me your ideas & requirements by clicking the button below