Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention

Abstract

This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)--based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and synthesis, it is known that model adaptation to the target speaker improves the accuracy. Our research question is whether a DNN for speech enhancement can be adopted to unknown speakers without any auxiliary guidance signal in test-phase. To achieve this, we adopt multi-task learning of speech enhancement and speaker identification, and use the bottleneck feature of speaker identification as an auxiliary feature. In addition, we use multi-head self-attention for capturing long-term dependencies in the speech and noise. Experimental results on a public dataset show that our strategy achieves the state-of-the-art performance and also outperform conventional methods in terms of subjective quality.

Audio examples

This section shows the audio samples used subjective evaluation of this paper. Training and test were conducted on the dataset published by Valentini et al. [1]. The proposed method was compared to two conventional methods, SEGAN [2] and Deep Feature Loss [3], because speech samples of both methods are openly available on the web-page [Link]. All samples can be downloaded from [Link].

References

[1] C. Valentini-Botinho, et al., "Investigating RNN-based Speech Enhancement methods for Noise-Robust Text-to-Speech," Proc. of 9th ISCA Speech Synth. Workshop (SSW), 2016.

[2] S. Pascual, et al., "SEGAN: Speech Enhancement Generative Adversarial Network," Proc. of Interspeech, 2017.

[3] F. G. Germain, et al., "Speech Denoising with Deep Feature Losses," arXiv preprint, Proc. of Interspeech, 2019.

[4] Y. Koizumi, et al., "Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention," Proc. of ICASSP, 2020.