Audio signal processing
Wordcloud image generated by research papers on audio separation and sound event detection, Machine Intelligence Lab.
Among many applications of audio signal processing, Machine Intelligence Lab is interested in the following areas.
Monaural audio source separation
Separating out multiple source signals from a single channel mixture input. It is defined as finding a one-to-many mapping, so usually one of the hardest class of problems in signal processing and machine learning research.
Various deep learning and machine learning methods are applied to use prior knowledge to compensate the necessary information for the source signals, generally, music and speech sounds.
Speaker-attributed speech separation: use small amount of target speech to EXTRACT the target speech from the mixture signals.
Sound event detection
Finding source identity and timing information along the time axis.
Various deep learning methods such as U-Net and CNNs are applied to learn the statistical characteristics of audio signals.
Sound Event Detection
(In construction)
Monaural audio separation
Single channel audio separation, or monaural sound source separation, is a problem of extracting multiple sounds from a mixture of the sources given by a single channel input. The problem is ill-posed mathematically intractable because there is inherent information loss from input to outputs: MT unknowns and T given variables.
To complement information loss, prior knowledge of the sources is necessary. The separation performance is usually dependent on the characteristics of the sound sources.
(SCIE) Seungtae Kang, Jeong-Sik Park, Gil-Jin Jang. Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks. Applied Sciences-Basel (MDPI). Appl. Sci. 2020, 10:7(2465). pp. 1-15.
(Extended Abstract/poster) Seungtae Kang, Gil-Jin Jang. Loss Function Weighting Based on Source Dominance for Monaural Source Separation Using Recurrent Neural Networks. 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA). University of Surrey, Guildford, UK. July 2-6, 2018.
(SCI) Han-Gyu Kim, Gil-Jin Jang, Yung-Hwan Oh, and Ho-Jin Choi*. Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation. The Journal of Supercomputing, Vol. 76, Issue 10, pp. 8193-8213 (2020)
(SCI) Han-Gyu Kim, Gil-Jin Jang, Jeong-Sik Park, Yung-Hwan Oh, and Ho-Jin Choi*. Single channel blind source separation based on probabilistic matrix factorisation. ELECTRONICS LETTERS, Vol. 53 No. 21, pages 1429-1431, 12th October 2017.
(KCI) Gil-Jin Jang. Audio signal clustering and separation using a stacked autoencoder. The Journal of the Acoustical Society of Korea, Vol. 35, No. 4, pages 303-309, July 2016.
Particle Filtering Based Pitch Sequence Correction for Monaural Speech Segregation. Han-Gyu Kim, Gil-Jin Jang, Jeong-Sik Park, Ji-Hwan Kim, Yung-Hwan Oh. International Journal of Imaging Systems and Technology, 23:64-70, 2013.
Speech Segregation based on Pitch Track Correction and Music-Speech Classification. Han-Gyu Kim, Gil-Jin Jang, Jeong-Sik Park, Ji-Hwan Kim, Yung-Hwan Oh. Advances in Electrical and Computer Engineering, 12(2):15-20, May 2012.
A Maximum Likelihood Approach to Single-channel Source Separation. Gil-Jin Jang, Te-Won Lee. Journal of Machine Learning Research, Special Issue on Independent Component Analysis, Volume 4, pages 1365-1392, December 2003.
Single Channel Signal Separation Using MAP-based Subspace Decomposition. Gil-Jin Jang, Te-Won Lee, Yung-Hwan Oh. Electronics Letters, Volume 39, Number 24, pages 1766-1767, 27th November 2003.
Single Channel Signal Separation Using Time-Domain Basis Functions. Gil-Jin Jang, Te-Won Lee, Yung-Hwan Oh. IEEE Signal Processing Letters, Volume 10, Number 6, pages 168-171, June 2003.
Sound separation demo
Based on A Maximum Likelihood Approach to Single-channel Source Separation. Gil-Jin Jang and Te-Won Lee. Journal of Machine Learning Research, Volume 4, pages 1365-1392, December 2003.
Acoustic signal characteristics.
Review: The cocktail party problem
Term coined by Colin Cherry, a British engineer working at MIT
how do we recognise what one person is saying when others are speaking at the same time (Cherry, 1953)
for cocktail party-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers (Bronkhorst & Plomp, 1992)
A modern (purely acoustic) perspective, from Yost (1997)
Exploiting physical attributes of sounds
Spectral separation
Spectral profile
Harmonicity
Spatial separation
Temporal separation
Temporal onsets/offsets
Temporal modulations
How to implement in machines: Prior information is necessary
The characteristics of the sources should be known to overcome data insufficiency
Prediction-driven approaches (classical methods)
CASA (computational auditory scene analysis)
Statistical approaches: Building a statistical model which is suitable to the source signals
Human listeners can easily isolate and understand target speech
Cocktail-party effect
Binaural hearing
Context info: visual, linguistic selective attention
Our work
Loss function weighting for RNN-based separation
Quick summary: a loss weighting method based on dominance between sources to effectively train the RNN-based baseline model.
Method: mutual dominance factor is defined by the multiplication of the inverse of the individual dominance values. If one source component dominates, this factor increases sharply and this factor has a minimum when both sources are at the same rate. When a dominant of one component approaches 1, this factor grows infinitely, so it is scaled appropriately by log scale and used as a weight.
(Extended Abstract/poster) Seungtae Kang, Gil-Jin Jang. Loss Function Weighting Based on Source Dominance for Monaural Source Separation Using Recurrent Neural Networks. 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA). University of Surrey, Guildford, UK. July 2-6, 2018.
(SCIE) Seungtae Kang, Jeong-Sik Park, Gil-Jin Jang. Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks. Applied Sciences-Basel (MDPI). Appl. Sci. 2020, 10:7(2465). pp. 1-15.
Guided Source Separation
Guided Training: A Simple Method for Single-channel Speaker Separation. Hao Li, Xueliang Zhang, Guanglai Gao. [https://arxiv.org/abs/2103.14330]
Sepformer. Attention is All You Need in Speech Separation. Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan Zhong. [https://arxiv.org/abs/2010.13154]
Guided source separation. The GUIDE speech is added before the mixture to indicate which type of source should be extracted.
Overview of sepformer
Conv-TasNet
Trying to catch up the state-of-the-art source separation method, conv-TasNet
Yi Luo, Nima Mesgarani. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech and Language Processing. 2019. https://arxiv.org/abs/1809.07454
Figure 4. Conv-TasNet Block Diagram, https://deeesp.github.io/speech/Conv-TasNet-1/
3D audio signal processing
3D effect generation using VBAP (vector-base amplitude panning) and DBAP (distance-base amplitude panning)
VBAP
vector-base amplitude panning
DBAP
distance-base amplitude panning