Audio signal processing

Wordcloud image generated by research papers on audio separation and sound event detection, Machine Intelligence Lab.

Among many applications of audio signal processing, Machine Intelligence Lab is interested in the following areas.

Monaural audio source separation
- Separating out multiple source signals from a single channel mixture input. It is defined as finding a one-to-many mapping, so usually one of the hardest class of problems in signal processing and machine learning research.
- Various deep learning and machine learning methods are applied to use prior knowledge to compensate the necessary information for the source signals, generally, music and speech sounds.
- Speaker-attributed speech separation: use small amount of target speech to EXTRACT the target speech from the mixture signals.
Sound event detection
- Finding source identity and timing information along the time axis.
- Various deep learning methods such as U-Net and CNNs are applied to learn the statistical characteristics of audio signals.

Sound Event Detection

Monaural audio separation

Sound separation demo

Our work

Loss function weighting for RNN-based separation

Guided Source Separation

Conv-TasNet

3D audio signal processing

Sound Event Detection

(In construction)

Monaural audio separation

Single channel audio separation, or monaural sound source separation, is a problem of extracting multiple sounds from a mixture of the sources given by a single channel input. The problem is ill-posed mathematically intractable because there is inherent information loss from input to outputs: MT unknowns and T given variables.

To complement information loss, prior knowledge of the sources is necessary. The separation performance is usually dependent on the characteristics of the sound sources.

(SCIE) Seungtae Kang, Jeong-Sik Park, Gil-Jin Jang. Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks. Applied Sciences-Basel (MDPI). Appl. Sci. 2020, 10:7(2465). pp. 1-15.
(Extended Abstract/poster) Seungtae Kang, Gil-Jin Jang. Loss Function Weighting Based on Source Dominance for Monaural Source Separation Using Recurrent Neural Networks. 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA). University of Surrey, Guildford, UK. July 2-6, 2018.
(SCI) Han-Gyu Kim, Gil-Jin Jang, Yung-Hwan Oh, and Ho-Jin Choi*. Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation. The Journal of Supercomputing, Vol. 76, Issue 10, pp. 8193-8213 (2020)
(SCI) Han-Gyu Kim, Gil-Jin Jang, Jeong-Sik Park, Yung-Hwan Oh, and Ho-Jin Choi*. Single channel blind source separation based on probabilistic matrix factorisation. ELECTRONICS LETTERS, Vol. 53 No. 21, pages 1429-1431, 12th October 2017.
(KCI) Gil-Jin Jang. Audio signal clustering and separation using a stacked autoencoder. The Journal of the Acoustical Society of Korea, Vol. 35, No. 4, pages 303-309, July 2016.
Particle Filtering Based Pitch Sequence Correction for Monaural Speech Segregation. Han-Gyu Kim, Gil-Jin Jang, Jeong-Sik Park, Ji-Hwan Kim, Yung-Hwan Oh. International Journal of Imaging Systems and Technology, 23:64-70, 2013.
Speech Segregation based on Pitch Track Correction and Music-Speech Classification. Han-Gyu Kim, Gil-Jin Jang, Jeong-Sik Park, Ji-Hwan Kim, Yung-Hwan Oh. Advances in Electrical and Computer Engineering, 12(2):15-20, May 2012.
A Maximum Likelihood Approach to Single-channel Source Separation. Gil-Jin Jang, Te-Won Lee. Journal of Machine Learning Research, Special Issue on Independent Component Analysis, Volume 4, pages 1365-1392, December 2003.
Single Channel Signal Separation Using MAP-based Subspace Decomposition. Gil-Jin Jang, Te-Won Lee, Yung-Hwan Oh. Electronics Letters, Volume 39, Number 24, pages 1766-1767, 27th November 2003.
Single Channel Signal Separation Using Time-Domain Basis Functions. Gil-Jin Jang, Te-Won Lee, Yung-Hwan Oh. IEEE Signal Processing Letters, Volume 10, Number 6, pages 168-171, June 2003.

Listen2.mp4

Sound separation demo

Based on A Maximum Likelihood Approach to Single-channel Source Separation. Gil-Jin Jang and Te-Won Lee. Journal of Machine Learning Research, Volume 4, pages 1365-1392, December 2003.

Acoustic signal characteristics.

Review: The cocktail party problem

Term coined by Colin Cherry, a British engineer working at MIT

how do we recognise what one person is saying when others are speaking at the same time (Cherry, 1953)

for cocktail party-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers (Bronkhorst & Plomp, 1992)

A modern (purely acoustic) perspective, from Yost (1997)

Exploiting physical attributes of sounds
Spectral separation
Spectral profile
Harmonicity
Spatial separation
Temporal separation
Temporal onsets/offsets
Temporal modulations

How to implement in machines: Prior information is necessary

The characteristics of the sources should be known to overcome data insufficiency
Prediction-driven approaches (classical methods)
CASA (computational auditory scene analysis)
Statistical approaches: Building a statistical model which is suitable to the source signals

Human listeners can easily isolate and understand target speech

Cocktail-party effect
Binaural hearing
Context info: visual, linguistic selective attention

Our work

Loss function weighting for RNN-based separation

Quick summary: a loss weighting method based on dominance between sources to effectively train the RNN-based baseline model.

Method: mutual dominance factor is defined by the multiplication of the inverse of the individual dominance values. If one source component dominates, this factor increases sharply and this factor has a minimum when both sources are at the same rate. When a dominant of one component approaches 1, this factor grows infinitely, so it is scaled appropriately by log scale and used as a weight.

(Extended Abstract/poster) Seungtae Kang, Gil-Jin Jang. Loss Function Weighting Based on Source Dominance for Monaural Source Separation Using Recurrent Neural Networks. 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA). University of Surrey, Guildford, UK. July 2-6, 2018.
(SCIE) Seungtae Kang, Jeong-Sik Park, Gil-Jin Jang. Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks. Applied Sciences-Basel (MDPI). Appl. Sci. 2020, 10:7(2465). pp. 1-15.

Guided Source Separation

Guided Training: A Simple Method for Single-channel Speaker Separation. Hao Li, Xueliang Zhang, Guanglai Gao. [https://arxiv.org/abs/2103.14330]
Sepformer. Attention is All You Need in Speech Separation. Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan Zhong. [https://arxiv.org/abs/2010.13154]

Guided source separation. The GUIDE speech is added before the mixture to indicate which type of source should be extracted.

Overview of sepformer

Conv-TasNet

Trying to catch up the state-of-the-art source separation method, conv-TasNet

Yi Luo, Nima Mesgarani. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech and Language Processing. 2019. https://arxiv.org/abs/1809.07454
https://deeesp.github.io/speech/Conv-TasNet-1/
https://github.com/JusperLee/Conv-TasNet
https://github.com/kaituoxu/Conv-TasNet

Figure 4. Conv-TasNet Block Diagram, https://deeesp.github.io/speech/Conv-TasNet-1/

3D audio signal processing

3D effect generation using VBAP (vector-base amplitude panning) and DBAP (distance-base amplitude panning)

VBAP

vector-base amplitude panning

DBAP

distance-base amplitude panning

Google Sites

Report abuse