Deep Griffin-Lim iteration: Trainable iterative phase reconstruction using neural network

Abstract

In this paper, we propose a phase reconstruction framework, named Deep Griffin--Lim Iteration (DeGLI). Phase reconstruction is a fundamental technique for improving the quality of sound obtained through some process in the time-frequency domain. It has been shown that the recent methods using deep neural networks (DNN) outperformed the conventional iterative phase reconstruction methods such as the Griffin--Lim algorithm (GLA). However, the computational cost of DNN-based methods is not adjustable at the time of inference, which may limit the range of applications. To address this problem, we combine the iterative structure of GLA with a DNN so that the computational cost becomes adjustable by changing the number of iterations of the proposed DNN-based component. A training method that is independent of the number of iterations for inference is also proposed to minimize the computational cost of the training. This training method, named sub-block training by denoising (SBTD), avoids recursive use of the DNN and enables training of DeGLI with a single sub-block (corresponding to one GLA iteration). Furthermore, we propose a complex DNN based on complex convolution layers with gated mechanisms and investigated its performance in terms of the proposed framework. Through several experiments, we found that DeGLI significantly improved both objective and subjective measures from GLA by incorporating the DNN, and its sound quality was comparable to those of neural vocoders.

Audio samples

Experimental condition

  • The reconstructed signals from linear or mel spectrograms of utterances in the LJ speech dataset [1].

  • STFT was implemented with the Hann window, whose length was 46.4 ms , with 11.6 ms shifting.


Systems used for comparison

  • Griffin-Lim algorithm* [2]

  • Open source WaveNet* [3]

  • Official WaveGlow* [4]

  • Proposed method (DeGLI w/ complex DNN) [5]

*Audio samples were brought from the public folder of WaveGlow.

Codes

  • Our code in TensorFlow is available from [Here] .

Reference

[1] Keith Ito, "The LJ Speech Dataset." [Link]

[2] D. Griffin and J. Lim, "Signal estimation from modified short-time Fourier transform," IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236--243, Apr. 1984. [Link]

[3] R. Yamamoto, “Wavenet vocoder.” [Link]

[4] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2019, pp. 3617--3621. [Link]

[5] Y. Masuyama, K. Yatabe, Y. Koizumi, Y. Oikawa and N. Harada, "Deep Griffin-Lim iteration: Trainable iterative phase reconstruction using neural network " IEEE J. Sel. Top. Signal Process., (accepted)