RESEARCH

Acoustic & Speech

Image & Vision

Natural Language Processing

Acoustic & Speech

Investigates various Speech signal processing schemes for acoustic modeling so that more robust speech recognition can be achieved. Our aim is to perform the state-of-art research providing effective means for achieving:

Voice Conversion

Admin 2021-04-08 👁️ 498

1. Introduction

- This research is to convert the voice style of the source to the style of other speaker while preserving linguistic content. Voice conversion can be applied in various applications. (e.g. AI Avatar, Text to Speech, Singing)

2. Relevant algorithms

Fig. 1 Voice Conversion Framework

(1) Feature extraction

The mel spectrogram represents an acoustic time-frequency representation of a speech. This feature is widely used in the speech filed and can be obtained through Short Time Fourier Transform (STFT) and Mel filter bank.

(2) Encoder

In order to improve the performance of the voice conversion, it is important to disentangle the content and style[1]. The content encoder extracts the linguistic content of the source speech. The style encoder extracts a reference speech style that is distinct from other targets.

(3) Decoder

The decoder uses the AdaIN [2] technique to add style to the content. Through this process, voice conversion proceeds to change only the style while preserving linguistic content.

(4) Vocoder

Based on the converted mel-spectrogram, Vocoder is used to generate human-heard speech [3].

3. Implementation

1. Record the source speech (Record -> Stop)

2. Select the reference speech.

3. Voice Conversion

Reference

[1] Yinghao Aaron Li, Ali Zare, and Nima Mesgarani, “Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,” in Interspeech, 2021.

[2] Ju chieh Chou and Hung-Yi Lee, “One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” in Proc. Interspeech 2019, 2019, pp. 664–668.

[3] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2020.

Page updated

Report abuse