Listening Samples

Here, we post audio samples for comparative listening between our DNN model and MP3. The following samples were obtained using a ResGLU-type autoencoder having 1.8M parameters. The loss function for the network training comprises a perceptual loss term based on a psychoacoustic model (PAM-1). We also approximate the quantization process using white noise with uniform distribution.

References

[1] Byun Joon, Shin Seungmin, Park Youngcheol, Sung Jongmo, and Beack Seungkwon, “Development of a psychoacoustic loss function for the deep neural network (DNN)-based speech coder,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2021, pp. 1694–1698.

[2] Valero Laparra Johannes Balle and Eero P. Simoncelli, “End-to-end optimization of nonlinear transform codes for perceptual quality,” in Picture Coding Symposium, 2016.

<Sample-1> Encoded at 48kbps, fs=32kHz

Original

A. Ours

B. MP3

<Sample-2> Encoded at 56kbps, fs=32kHz

Original

A. Ours

B. MP3

<Sample-3> Encoded at 64kbps, fs=32kHz

Original

A. Ours

B. MP3

Page updated

Google Sites

Report abuse