Deep Learning For Monaural Source Separation

Po-Sen HuangMinje KimMark Hasegawa-JohnsonParis Smaragdis



Abstract

Monaural source separation is important for many real world applications. It is challenging in that, given only single channel information is available, there is an infinite number of solutions without proper constraints.
In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including the monaural speech separation task, monaural singing voice separation task, and speech denoising task. The joint optimization of the deep recurrent neural networks with an extra masking layer enforces a reconstruction constraint. Moreover, we explore a discriminative training criterion for the neural networks to further enhance the separation performance. We evaluate our proposed system on TSP, MIR-1K, and TIMIT dataset for speech separation, singing voice separation, and speech denoising tasks, respectively. 
Our approaches achieve 2.30~4.98 dB SDR gain compared to NMF models in the speech separation task, 2.30~2.48 dB GNSDR gain and 4.32~5.42 dB GSIR gain compared to previous models in the singing voice separation task, and outperform NMF and DNN baseline in the speech denoising task.




Demo (download all)

Speech separation

TIMIT (speech separation) results

TIMIT mixture

TIMIT ground truth female

TIMIT ground truth male

TIMIT separated female

TIMIT separated male


TSP separated results

TSP mixture

TSP ground truth female

TSP ground truth male

TSP separated female

TSP separated male


Singing voice separation

MIR1K fdps_4_03 mixture

MIR1K ground truth singing

MIR1K ground truth music

MIR1K separated singing voice

MIR1K separated music


Speech denoising

TIMIT noisy speech

TIMIT ground truth speech

TIMIT ground truth noise

TIMIT separated speech

TIMIT separated noise


ICASSP 2014 Video Demo


Download

Reference
  1. Po-Sen HuangMinje KimMark Hasegawa-JohnsonParis Smaragdis
    Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation
    in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 12, pp. 2136–2147, Dec. 2015 (PDFBibtex)
  2. Po-Sen HuangMinje KimMark Hasegawa-JohnsonParis Smaragdis
    Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks
    Proc. of the International Society for Music Information Retrieval (ISMIR), 2014 (PDFBibtex)
  3. Po-Sen HuangMinje KimMark Hasegawa-JohnsonParis Smaragdis
    Deep Learning for Monaural Speech Separation 
  Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.  (PDFSlidesBibtex)
  [Starkey Signal Processing Research Student Grant]