Yoshiki Masuyama

Phase reconstruction based on recurrent phase unwrapping with deep neural networks

Abstract

Phase reconstruction, which estimates phase from a given amplitude spectrogram, is an active research field in acoustical signal processing with many applications including audio synthesis. To take advantage of rich knowledge from data, several studies presented deep neural networks (DNNs)–based phase reconstruction methods. However, the training of a DNN for phase reconstruction is not an easy task because of its periodic nature and sensitivity to the shift of a waveform. To overcome this problem, we propose a DNN-based two-stage phase reconstruction. In the proposed method, phase derivatives are estimated by DNNs instead of phase itself, which allows us to avoid the sensitivity problem. Then, phase is recursively estimated based on its derivatives, which is named recurrent phase unwrapping (RPU). The experimental results confirm that the proposed method outperformed the direct phase estimation by a DNN.

Example audio files

The reconstructed signals from amplitude spectrograms of clean utterances in JSUT corpus [1] are compared.
The Griffin-Lim algorithm (GLA) [2] was applied for 10 or 100 iterations.

Methods

Amplitude (zero phase): Apply iSTFT to a given amplitude spectrogram.
Direct phase estimation: Reconstruct phase by the von Mises DNN [3].
Instantaneous freq. integration: Reconstruct phase by integrating the instantaneous frequency estimated by a DNN as in [4].
Proposed method: Reconstruct phase from the instantaneous frequency and group delay by recurrent phase unwrapping (RPU) [5].

Original

Sample 0

Amplitude (zero phase)

Direct phase estimation [3]

Instantaneous freq. integration [4]

Proposed method [5]

Original

Sample 1

Amplitude (zero phase)

Direct phase estimation [3]

Instantaneous freq. integration [4]

Proposed method [5]

Reference

[1] R. Sonobe and S. Takamichi, “JSUT corpus: free largescale Japanese speech corpus for end-to-end speech synthesis,” arXiv:1711.00354, 2017.

[2] D. Griffin and J. Lim, "Signal estimation from modified short-time Fourier transform," IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236--243, Apr. 1984.

[3] S. Takamichi, Y. Saito, N. Takamune, D. Kitamura, and H.Saruwatari, “Phase reconstruction from amplitude spectrograms based on von–Mises-distribution deep neural network,” in Int. Workshop Acoust. Signal Enhance. (IWAENC), Sept. 2018, pp. 286--290.

[4] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “GANSynth: Adversarial neural audio synthesis,” in Int. Conf. Learn. Represent. (ICLR), 2019.

[5] Y. Masuyama, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Phase reconstruction based on recurrent phase unwrapping with deep neural networks,” (Submitted to ICASSP2020).