Tomoki Toda

Research

Tomoki Toda is interested in speech, music, and sound information processing. His research interests include statistical approaches to speech processing such as voice conversion, speech synthesis, speech analysis, speech recognition, and spoken dialogue, music processing such as music source separation and music signal generation, and sound information processing such as polyphonic sound event detection and sound event sympolization.

Selected papers

- Speech analysis
  - A. Miyashita, T. Toda. Differentiable representation of warping based on Lie group theory. Proc. IEEE WASPAA, 5 pages, Oct. 2023. [Paper]
  - A. Miyashita, T. Toda. Representation of vocal tract length transformation based on group theory. Proc. IEEE ICASSP, 5 pages, June 2023. [Paper]
  - T. Toda, K. Tokuda. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM. Proc. IEEE ICASSP, pp. 3925-3928, Apr. 2008. [Paper]
- Speech synthesis
  - R. Yoneyama, Y.-C. Wu, T. Toda. Source-Filter HiFiGAN: fast and pitch controllable high-fidelity neural vocoder. Proc. IEEE ICASSP, 5 pages, June 2023. [Paper]
  - R. Yoneyama, Y.-C. Wu, T. Toda. High-fidelity and pitch-controllable neural vocoder based on unified source-filter networks. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 31, pp. 3717-3729, Oct. 2023. [Paper]
  - Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic parallel WaveGAN: a non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 29, pp. 792-806, Feb. 2021. [Paper]
- Voice conversion
  - C. Xie, T. Toda. Noisy-to-noisy voice conversion under variations of noisy condition. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 31, pp. 3871-3882, Oct. 2023. [Paper]
  - W.-C. Huang, S.-W. Yang, T. Hayashi, T. Toda. A comparative study of self-supervised speech representation based voice conversion. IEEE Journal of Selected Topics in signal Processing, Vol. 16, No. 6, pp. 1308-1318, Oct. 2022. [Preprint]
  - W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, T. Toda. Pretraining techniques for sequence-to-sequence voice conversion. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 29, pp. 745-755, Feb. 2021. [Paper]
  - T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632-1636, Sep. 2016. [Paper]
  - T. Toda, A.W. Black, K. Tokuda. Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech and Language Processing, Vol. 15, No. 8, pp. 2222-2235, Nov. 2007. [Paper]
- Text-to-speech
  - T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe, T. Toda, K. Takeda, Y. Zhang, X. Tan. ESPNET-TTS: Uunified, reproducible, and integratable open source end-to-end text-to-speech toolkit. Proc. IEEE ICASSP, pp. 7654-7658, May 2020. [Paper]
  - S. Takamichi, T. Toda, A.W. Black, G. Neubig, S. Sakti, S. Nakamura. Post-filters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 24, No. 4, pp. 755-767, Apr. 2016. [Paper]
  - K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, K. Oura. Speech synthesis based on hidden Markov models. Proceedings of the IEEE, Vol. 101, No. 5, pp. 1234-1252, May 2013. [Link]
  - T. Toda, K. Tokuda. A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions, Vol. E90-D, No. 5, pp. 816-824, May 2007. [Link]
  - H. Kawai, T. Toda, J. Ni, M. Tsuzaki, K. Tokuda. XIMERA: a new TTS from ATR based on corpus-based technologies. Proc. 5th ISCA Speech Synthesis Workshop (SSW5), pp. 179-184, Pittsburgh, USA, June 2004. [Paper]
- Speech recognition
  - J. He, Z. Yang, T. Toda. ED-CEC: improving rare word recognition using ASR post-processing based on error detection and context-aware error correction. Proc. IEEE ASRU, 6 pages, Dec. 2023. [Paper]
  - T. Hayashi, S. Watanabe, Y. Zhang, T. Toda, T. Hori, R. Astudillo, K. Takeda. Back-translation-style data augmentation for end-to-end ASR. Proc. IEEE SLT, pp. 426-433, Dec. 2018. [Paper]
- Speech emotion recognition
  - J. He, X. Shi, X. Li, T. Toda. MF-AED-AEC: speech emotion recognition by leveraging multimodal fusion, ASR error detection, and ASR error correction. IEEE ICASSP, pp. 11066-11070, Seoul, Korea, Apr. 2024. [Paper]
  - X. Shi, X. Li, T. Toda. Emotion awareness in multi-utterance turn for improving emotion prediction in multi-speaker conversation. Proc. INTERSPEECH, pp. 765-769, Aug. 2023. [Paper]
  - A. Ando, T. Mori, S. Kobashikawa, T. Toda. Speech emotion recognition based on listener-dependent emotion perception models. APSIPA Transactions on Signal and Information Processing, Vol. 10, e6, pp. 1-11, Apr. 2021. [Paper]
- Spoken language processing
  - Y. Yasuda, T. Toda. Investigation of Japanese Png BERT language model in text-to-speech synthesis for pitch accent language. IEEE Journal of Selected Topics in signal Processing, Vol. 16, No. 6, pp. 1319-1328, Oct. 2022. [Paper]
  - T. Hayashi, S. Watanabe, T. Toda, K. Takeda, S. Toshniwal, K. Livescu. Pre-trained text embeddings for enhanced text-to-speech synthesis. Proc. INTERSPEECH, pp. 4430-4434, Sep. 2019. [Paper]
- Spoken dialog
  - T. Hiraoka, G. Neubig, S. Sakti, T. Toda, S. Nakamura. Learning cooperative persuasive dialogue policies using framing. Speech Communication, Vol. 84, pp. 83-96, Nov. 2016. [Link]
- Speech translation
  - Q. Truong Do, T. Toda, G. Neubig, S. Sakti, S. Nakamura. Preserving word-level emphasis in speech-to-speech translation. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 25, No. 3, pp. 544-556, Mar. 2017. [Link]
  - Y. Oda, G. Neubig, S. Sakti, T. Toda, S. Nakamura. Optimizing segmentation strategies for simultaneous speech translation. Proc. ACL, pp. 551-556, June 2014. [Paper]
- Speaking-aid
  - L.P. Violeta, D. Ma, W.-C. Huang, T. Toda. Pretraining and adaptation techniques for electrolaryngeal speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 32, pp. 2777-2789, May 2024. [Paper]
  - D. Ma, L.P. Violeta, K. Kobayashi, T. Toda. Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion. Proc. IEEE SLT, pp. 949-954, Jan. 2023. [Paper]
  - K. Kobayashi, T. Toda. Implementation of low-latency electrolaryngeal speech enhancement based on multi-task CLDNN. Proc. EUSIPCO, pp. 396-400, Aug. 2020. [Paper]
  - K. Morikawa, T. Toda. Electrolaryngeal speech modification towards singing aid system for laryngectomees. Proc. APSIPA, 4 pages, Dec. 2017. [Paper]
  - H. Doi, T. Toda, K. Nakamura, H. Saruwatari, K. Shikano. Alaryngeal speech enhancement based on one-to-many eigenvoice conversion. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 22, No. 1, pp. 172-183, Jan. 2014. [Paper]
  - K. Nakamura, T. Toda, H. Saruwatari, K. Shikano. Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Communication, Vol. 54, No. 1, pp. 134-146, Jan. 2012. [Link]
- Body conducted speech processing
  - Y. Tajiri, T. Toda. Nonaudible murmur enhancement based on statistical voice conversion and noise suppression with external noise monitoring. Proc. 9th ISCA Speech Synthesis Workshop (SSW9), pp. 54-60, Sep. 2016. [Paper]
  - T. Toda, M. Nakagiri, K. Shikano. Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE Transactions on Audio, Speech and Language Processing, Vol. 20, No. 9, pp. 2505-2517, Sep. 2012. [Paper]
  - T. Toda, K. Nakamura, T. Nagai, T. Kaino, Y. Nakajima, K. Shikano. Technologies for processing body-conducted speech detected with non-audible murmur microphone. Proc. INTERSPEECH, pp. 632-635, Sep. 2009. [Paper]
- Articulatory-acoustic mapping
  - P.L. Tobing, K. Kobayashi, T. Toda. Articulatory controllable speech modification based on statistical inversion and production mappings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 25, No. 12, pp. 2337-2350, Dec. 2017. [Paper]
  - T. Toda, A.W. Black, K. Tokuda. Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Communication, Vol. 50, No. 3, pp. 215-227, Mar. 2008. [Paper]
- Speech quality assessment
  - Y. Yasuda, T. Toda. Analysis of mean opinion scores in subjective evaluation of synthetic speech based on tail probabilities. Proc. INTERSPEECH, pp. 5491-5495, Aug. 2023. [Paper]
  - W.-C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, J. Yamagishi. The VoiceMOS Challenge 2022. Proc. INTERSPEECH, pp. 4536-4540, Sep. 2022. [Paper]
  - W.-C. Huang, E. Cooper, J. Yamagishi, T. Toda. LDNet: unified listener dependent modeling in MOS prediction for synthetic speech. Proc. IEEE ICASSP, pp. 896-900, May 2022. [Paper]
  - E. Cooper, W.-C. Huang, T. Toda, J. Yamagishi. Generalization ability of MOS prediction networks. Proc. IEEE ICASSP, pp. 8442-8446, May 2022. [Paper]
- Spoofing detection
  - X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K.A. Lee, L. Juvela, P. Alku, Y.-H. Peng, H.-T. Hwang, Y. Tsao, H.-M. Wang, S. Le Maguer, M. Becker, F. Henderson, R. Clark, Y. Zhang, Q. Wang, Y. Jia, K. Onuma, K. Mushika, T. Kaneda, Y. Jiang, L.-J. Liu, Y.-C. Wu, W.-C. Huang, T. Toda, K. Tanaka, H. Kameoka, I. Steiner, D. Matrouf, J.-F. Bonastre, A. Govender, S. Ronanki, J.-X. Zhang, Z.-H. Ling. ASVspoof 2019: a large-scale public database of synthetic, converted and replayed speech. Computer Speech and Language, Vol. 64, Article 101114, 25 pages, Nov. 2020. [Link]
  - T. Kinnunen, J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, Z. Ling. A spoofing benchmark for the 2018 voice conversion challenge: leveraging from spoofing countermeasures for speech artifact assessment. Proc. Odyssey 2018, pp. 187-194, June 2018. [Paper]
  - Z. Wu, P. De Leon, C. Demiroglu, A. Khodabakhsh, S. King, Z.-H. Ling, D. Saito, B. Stewart, T. Toda, M. Wester, J. Yamagishi. Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 24, No. 4, pp. 768-783, Apr. 2016. [Link]
- Singing voice conversion and synthesis
  - W.-C. Huang, L.P. Violeta, S. Liu, J. Shi, T. Toda. The Singing Voice Conversion Challenge 2023. Proc. IEEE ASRU, 8 pages, Dec. 2023. [Preprint]
  - R. Yamamoto, R. Yoneyama, T. Toda. NNSVS: a neural network based singing voice synthesis toolkit. Proc. IEEE ICASSP, 5 pages, June 2023. [Paper]
  - K. Kobayashi, T. Toda, S. Nakamura. Intra-gender statistical singing voice conversion with direct waveform modification using log-spectral differential. Speech Communication, Vol. 99, pp. 211-220, May 2018. [Paper]
  - K. Kobayashi, T. Toda, H. Doi, T. Nakano, M. Goto, G. Neubig, S. Sakti, S. Nakamura. Voice timbre control based on perceived age in singing voice conversion. IEICE Transactions on Information and Systems, Vol. E97-D, No. 6, pp. 1419-1428, June 2014. [Paper]
  - H. Doi, T. Toda, T. Nakano, M. Goto, S. Nakamura. Singing voice conversion method based on many-to-many eigenvoice conversion and training data generation using a singing-to-singing synthesis system. Proc. APSIPA ASC, Nov. 2012. [Paper]
- Automatic music transcription
  - S. Kim, K. Takeda, T. Toda. Sequence-to-sequence network training methods for automatic guitar transcription with tokenized outputs. Proc. ISMIR, pp. 524-531, Nov. 2023. [Paper]
  - S. Kim, T. Hayashi, T. Toda. Note-level automatic guitar transcription using attention mechanism. Proc. EUSIPCO, pp. 229-233, Aug.-Sep. 2022. [Paper]
- Music analysis
  - Y. Hashizume, L. Li, T. Toda. Music similarity calculation of individual instrumental sounds using metric learning. Proc. APSIPA ASC, pp. 33-38, Chiang Mai, Thailand, Nov. 2022. [Paper]
- Music source separation
  - S. Seki, T. Toda, K. Takeda. Stereophonic music separation based on non-negative tensor factorization with cepstral distance regularization. IEICE Transactions on Fundamentals, Vol. E101-A, No. 7, pp. 1057-1064, July 2018. [Link]
- Source separation and target source extraction
  - R. Wang, T. Toda. Directional target speaker extraction under noisy underdetermined conditions through conditional variational autoencoder with global style tokens. Proc. IEEE WASPAA, 5 pages, Oct. 2023. [Paper]
  - S. Seki, H. Kameoka, L. Li, T. Toda, K. Takeda. Underdetermined source separation based on generalized multichannel variational autoencoder. IEEE Access, Vol. 7, No. 1, pp. 168104-168115, Nov. 2019. [Paper]
- Multichannel signal processing
  - S. Luan, Y. Wakabayashi, T. Toda. Sound field interpolation with unsupervised calibration for freely spaced circular microphone array in rotation-robust beamforming Proc. EUSIPCO, pp. 21-25, Sep. 2023. [Paper]
  - S. Luan, Y. Wakabayashi, T. Toda. Modified sound field interpolation method for rotation-robust beamforming with unequally spaced circular microphone array. Proc. EUSIPCO, pp. 344-348, Aug.-Sep. 2022. [Paper]
  - H. Maki, T. Toda, S. Sakti, G. Neubig, S. Nakamura. Enhancing event-related potentials based on maximum a posteriori estimation with a spatial correlation prior. IEICE Transactions on Information and Systems, Vol. E99-D, No. 6, pp. 1410-1419, June 2016. [Paper]
- Sound event detection
  - K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, K. Takeda. Weakly-supervised sound event detection with self-attention. Proc. IEEE ICASSP, pp. 66-70, May 2020. [Paper]
  - T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux, K. Takeda. Duration-controlled LSTM for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 25, No. 11, pp. 2059-2070, Nov. 2017. [Paper]
- Audio captioning and sound event descriptions
  - T. Komatsu, Y. Fujita, K. Takeda, T. Toda. Audio difference learning for audio captioning. IEEE ICASSP, Apr. 2024. [Preprint]
  - K. Miyazaki, T. Hayashi, T. Toda, K. Takeda. Connectionist temporal classification-based sound event encoder for converting sound events into onomatopoeia representations. Proc. EUSIPCO, pp. 857-861, Sep. 2018. [Paper]
- Anomalous sound detection
  - I. Kuroyanagi, T. Hayashi, K. Takeda, T. Toda. Improvement of serial approach to anomalous sound detection by incorporating two binary cross-entropies for outlier exposure. Proc. EUSIPCO, pp. 294-298, Aug.-Sep. 2022. [Paper]
  - I. Kuroyanagi, T. Hayashi, K. Takeda, T. Toda. Anomalous sound detection using a binary classification model and class centroids. Proc. EUSIPCO, pp. 1995-1999, Aug. 2021. [Paper]
  - T. Hayashi, T. Komatsu, R. Kondo, T. Toda, K. Takeda. Anomalous sound event detection based on WaveNet. Proc. EUSIPCO, pp. 2508-2512, Sep. 2018. [Paper]

Theses

Graduation Thesis [Itakura Lab. (Acoustic & Speech Processing Group)]
- ``Improvement of STRAIGHT Analysis-Synthesis Method under Noisy Conditions'' (in Japanese)
Master's Thesis [Shikano Lab. (Speech and Acoustics Laboratory)]
- ``High Quality Voice Conversion Based on STRAIGHT Analysis-synthesis Method'' (in Japanese)
Doctoral Thesis [Shikano Lab.]
- ``High-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion''

[Tomoki Toda]

Page updated

Google Sites

Report abuse