Blizzard Challenge 2015

概要/Abstract

Blizzard Challengeは、テキスト音声合成の品質（自然性・明瞭性・話者性）を競う国際ワークショップです。2015年は、6つのインド言語のテキストと音声波形が配布され、そこからテキスト音声合成器を構築し、音声を合成しました。

Blizzard Challenge is an international workshop to compare the text-to-speech performance on naturalness, intelligibility and similarity. In this year, text and speech waveform of 6 Indian languages were distributed. We built the text-to-speech synthesizer from the data, and synthesized the synthetic speech.

我々のシステム/Our system

基本的には、HMM音声合成です。ただし、 トラジェクトリスムージング

変調スペクトルを考慮した生成

を追加しています。使用したツールキットは、

HTS ... 音響モデル学習

Festival ... テキスト解析

STRAIGHT ... スペクトル・非周期指標推定

WORLD ... 基本周波数推定

です。コンテキスト生成器は、GitHubにあります。変調スペクトルを考慮した生成以外は、オープンソースのソフトウェアで構築できます。

Basically, our system is HMM-based speech synthesis except:

Trajectory smoothing,

Generation considering modulation spectra.

The toolkit we used are:

HTS ... HMM training

Festival ... text analysis

STRAIGHT ... spectrum & aperiodicity extraction

WORLD ... F0 extraction.

Context extractor is available on GitHub. All modules except generation considering modulation spectra is available online.

キーポイント/Key points

変調スペクトル（MS）とは、パラメータ時系列（例えば、メルケプストラム係数の系列）のパワースペクトルを表します[Takamichi, ICASSP 2014]。ここで、

低域変調周波数成分 ... 時間的に緩やかに変化する成分

高域変調周波数成分 ... 時間的に激しく変化する成分

となります。これをHMM音声合成に導入する際に重要なのが、

1. HMMは、激しく変化する系列のモデル化に不適切であること

2. 高域変調周波数のMSの、知覚的音質への影響は小さいこと [Takamichi, ICASSP 2015.]

3. 合成音声のMSを自然音声のMSに近づけると、音質が改善すること [Takamichi, ICASSP 2014.]

です。

トラジェクトリスムージングと、MSを考慮した生成は、この３点を利用しています。つまり、

トラジェクトリスムージング ... 学習データに含まれる高域変調周波数のMSを、ローパスフィルタで除去すること

MSを考慮した生成 ... 低域変調周波数のMSを自然音声のMSに近づけるように、合成音声のパラメータを生成すること

となります。

The modulation spectrum (MS) is the power spectra of the temporal parameter sequence (e.g., mel-cepstral coefficient sequence) [Takamichi, ICASSP 2014.]. Here,

Lower modulation frequency ... temporally-smoothed sequence

Higher modulation frequency ... temporally-fluctuated sequence

The following points are important to introduce the MS into the HMM-based speech synthesis.

1. The HMM is unsuitable to model the fluctuated sequence

2. Higher modulation frequency components don't affect to the perceptual speech quality [Takamichi, ICASSP 2015.]

3. Making the MS of synthetic speech close to that of natural speech improves the speech quality [Takamichi, ICASSP 2014.]

Trajectory smoothing and generation considering the MS utilize those points.

Trajectory smoothing ... removes the higher modulation frequency components of the training data

Generation considering the MS ... generates synthetic speech parameters to make its MS close to natural MS

結果/Results

ここでは、1言語（Marathi）のみの結果を示します。残りの結果は、論文やスライドを参考にしてください。

NAISTは、"J" システムです。合成音声の評価は、自然性に関する5段階MOS評価、明瞭性のためのWER、元話者への類似性に関するDMOS評価です。

Here, we show the results in Marathi. You can find other results in the paper and slide. Our team (NAIST) is "J." The evaluations of synthetic speech are 1) 5-point MOS test on naturalness, 2) WER for intelligibility, and 3) 5-point DMOS test on similarity to the original speaker.

自然性/Naturalness

合成音声の中で最高の自然性スコアを獲得しました。また、我々のシステムは、6言語中3言語において、最高スコアを獲得しました。

Our system achieved the best score on naturalness among synthetic speech, and it got the 1st place in 3 languages.

明瞭性/Intelligibility

Marathiでは、全音声中で最高のスコアを獲得し、自然音声を上回るスコアとなりました。ただし、明瞭性における成績は、言語に大きく依存するものとなりました。

In Marathi, our system got the 1st place, and overcame the score of natural speech. However, the performance in intelligibility varies language by language.

話者性/Similarity

話者性に関しては、全ての言語で中程度のランクとなりました。これに関しては、改善が必要です。

Our system was the middle-ranking system for all languages. This is an issue we should solve.