Wangyou Zhang, Yanmin Qian. Unified Speech Enhancement Technique for Diverse Input Conditions (in Chinese). Journal of Signal Processing, pp. 1–22, 2025. [paper]
Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe. An End-to-End Integration of Speech Separation and Recognition With Self-Supervised Learning Representation. Computer Speech & Language, pp. 101813, 2025. [paper]
Yanmin Qian, Chenda Li, Wangyou Zhang, Shaoxiong Lin. Contextual Understanding With Contextual Embeddings for Multi-Talker Speech Separation and Recognition in a Cocktail Party. npj Acoustics, Vol 1(1), pp. 3, 2025. [paper]
Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe. SpoofCeleb: Speech Deepfake Detection and SASV in the Wild. IEEE Open Journal of Signal Processing, vol 6, pp. 68–77, 2025. [paper]
Xuankai Chang, Shinji Watanabe, Marc Delcroix, Tsubasa Ochiai, Wangyou Zhang, Yanmin Qian. Module-Based End-to-End Distant Speech Processing: A Case Study of Far-Field Automatic Speech Recognition [Special Issue on Model-Based and Data-Driven Audio Signal Processing]. IEEE Signal Processing Magazine, vol. 41, no. 6, pp. 39–50, 2024. [paper]
Wangyou Zhang, Kohei Saijo, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Wei Wang, Yihui Fu, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian. Lessons Learned from the URGENT 2024 Speech Enhancement Challenge. Proc. Interspeech, pp. 853–857, 2025. [paper] [slides]
Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Yihui Fu, Wei Wang, Tim Fingscheidt, Shinji Watanabe. Interspeech 2025 URGENT Speech Enhancement Challenge. Proc. Interspeech, pp. 858–862, 2025. [paper]
Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe. The Text-to-speech in the Wild (TITW) Database. Proc. Interspeech, pp. 4798–4802, 2025. [paper]
Xun Gong, Anqi Lv, Wangyou Zhang, Zhiming Wang, Huijia Zhu, Yanmin Qian. BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM. Proc. Interspeech, pp. 4043–4047, 2025. [paper]
Haoxiang Hou, Xun Gong, Wangyou Zhang, Wei Wang, Yanmin Qian. Ranking and Selection of Bias Words for Contextual Bias Speech Recognition. Proc. Interspeech, pp. 5183–5187, 2025. [paper]
Leying Zhang, Wangyou Zhang, Zhengyang Chen, Yanmin Qian. Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction. Proc. ICASSP, 2025. [paper]
Jiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, Dareen Safar Alharthi, Yichen Huang, Koichi Saito, Jionghao Han, Yiwen Zhao, Chris Donahue, Shinji Watanabe. VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), pp. 191–209, 2025. [paper]
William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, and Shinji Watanabe.Towards Robust Speech Representation Learning for Thousands of Languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 10205–10224. (Best Paper Award) [paper]
Xin Zhou, Wangyou Zhang, Chenda Li, and Yanmin Qian. Insights from Hyperparameter Scaling of Online Speech Separation. In 14th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2024, pp. 563–567. [paper]
Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, and Yanmin Qian. URGENT Challenge: Universality, Robustness, and Generalizability for Speech Enhancement. In 25th Annual Conference of the International Speech Communication Association (INTERSPEECH), Kos, Greece, 2024, pp. 4868–4872. [paper] [slides]
Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, and Yanmin Qian. Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement. In 25th Annual Conference of the International Speech Communication Association (INTERSPEECH), Kos, Greece, 2024, pp. 1740–1744. [paper] [poster]
Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe. ESPnet-SPK: Full Pipeline Speaker Embedding Toolkit With Reproducible Recipes, Self-Supervised Front-Ends, and off-the-Shelf Models. In 25th Annual Conference of the International Speech Communication Association (INTERSPEECH), Kos, Greece, 2024, pp. 4278–4282. [paper]
Wangyou Zhang, Jee-weon Jung, Yanmin Qian. Improving Design of Input Condition Invariant Speech Enhancement. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Korea, 2024, pp. 10696-10700. [paper] [poster]
Linfeng Yu, Wangyou Zhang, Chenpeng Du, Leying Zhang, Zheng Liang, Yanmin Qian. Generation-Based Target Speech Extraction with Speech Discretization and Vocoder. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Korea, 2024, pp. 12612-12616. [paper]
Shaoxiong Lin, Wangyou Zhang, Yanmin Qian. Two-Stage Single-Channel Speech Enhancement with Multi-Frame Filtering. In Applied Sciences, vol. 13, no. 8, 2023: 4926. [paper]
Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian. Toward Universal Speech Enhancement For Diverse Input Conditions. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taiwan, 2023. [paper] [slides] [poster]
Wangyou Zhang, Lei Yang, Yanmin Qian. Exploring Time-Frequency Domain Target Speaker Extraction For Causal and Non-Causal Processing. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taiwan, 2023. [paper] [slides] [poster]
Kohei Saijo, Wangyou Zhang, Zhong-Qiu Wang, Shinji Watanabe, Tetsunori Kobayashi, Tetsuji Ogawa. A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023. [paper]
Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe. Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023. [paper]
William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe. Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023. [paper]
Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe. Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 2023. [paper]
Wangyou Zhang, Yanmin Qian. Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition. In 24th Annual Conference of the International Speech Communication Association (INTERSPEECH), Dublin, Ireland, 2023, pp. 3517–3521. [paper] [slides] [poster]
Linfeng Yu, Wangyou Zhang, Chenda Li, Yanmin Qian. Overlap Aware Continuous Speech Separation without Permutation Invariant Training. In 24th Annual Conference of the International Speech Communication Association (INTERSPEECH), Dublin, Ireland, 2023, pp. 3512–3516. [paper] [poster]
Robin Scheibler, Wangyou Zhang, Xuankai Chang, Shinji Watanabe, Yanmin Qian. End-to-End Multi-speaker ASR with Independent Vector Analysis. In IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023, 496-501. [paper]
Wangyou Zhang, Xuankai Chang, Christoph Boeddeker, Tomohiro Nakatani, Shinji Watanabe, Yanmin Qian. End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 3173–3188, 2022. [paper]
Wei Wang, Wangyou Zhang, Shaoxiong Lin, Yanmin Qian. Text-Informed Knowledge Distillation for Robust Speech Enhancement and Recognition. In 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2022, pp. 334-338. [paper]
Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei. Separating Long-form Speech with Group-wise Permutation Invariant Training. In 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH), Incheon, Korea, 2022, pp. 5383-5387. [paper] [slides] [video] [poster]
Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe. ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding. In 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH), Incheon, Korea, 2022, pp. 5458-5462. [paper]
Wei Wang, Xun Gong, Yifei Wu, Zhikai Zhou, Chenda Li, Wangyou Zhang, Bing Han, Yanmin Qian. The SJTU System for Multimodal Information Based Speech Processing Challenge 2021. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022, pp. 9261-9265. [paper]
Yen-Ju Lu, Samuele Cornell, Xuankai Chang, Wangyou Zhang, Chenda Li, Zhaoheng Ni, Zhong-Qiu Wang, Shinji Watanabe. Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPnet-SE Submission to the L3DAS22 Challenge. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022, pp. 9201-9205. [paper]
Yu Xi, Tian Tan, Wangyou Zhang, Baochen Yang, Kai Yu. Text Adaptive Detection for Customizable Keyword Spotting. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022, pp. 6652-6656. [paper]
Zhikai Zhou, Wei Wang, Wangyou Zhang, Yanmin Qian. Exploring Effective Data Utilization for Low-Resource Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022, pp. 8192-8196. [paper]
Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian. Closing the Gap Between Time-domain Multi-channel Speech Enhancement on Real and Simulation Conditions. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, 2021, pp. 146-150. [paper] [slides] [video] [poster]
Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Chenda Li, Jing Shi, Aswin Shanmugam Subramanian, Wangyou Zhang. The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans. IEEE Data Science and Learning Workshop (DSLW), 2021, pp. 1-6. [paper]
Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach and Yanmin Qian. End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Ontario, Canada, 2021, pp. 6898-6902. [paper] [slides] [video] [poster]
Christoph Boeddeker, Wangyou Zhang, Tomohiro Nakatani, Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Naoyuki Kamo, Yanmin Qian and Reinhold Haeb-Umbach. Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Ontario, Canada, 2021, pp. 8428-8432. [paper]
Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang and Yuekai Zhang. Recent Developments on ESPnet Toolkit Boosted by Conformer. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Ontario, Canada, 2021, pp. 5874-5878. [paper]
Chenda Li, Jing Shi, Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Naoyuki Kamo, Moto Hira, Tomoki Hayashi, Christoph Boeddeker, Zhuo Chen and Shinji Watanabe. ESPnet-SE: End-to-End Speech Enhancement and Separation Toolkit Designed for ASR Integration. In IEEE Spoken Language Technology Workshop (SLT), Shenzhen, Shanghai, 2021, pp. 785-792. [paper] [slides]
Wangyou Zhang, Xuankai Chang, Yanmin Qian and Shinji Watanabe. Improving End-to-End Single-Channel Multi-Talker Speech Recognition. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1385-1394, 2020. [paper]
Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Shinji Watanabe and Yanmin Qian. End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming. In 21st Annual Conference of the International Speech Communication Association (INTERSPEECH), Shanghai, China, 2020, pp. 324-328. [paper] [slides] [video]
Wangyou Zhang and Yanmin Qian. Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition. In 21st Annual Conference of the International Speech Communication Association (INTERSPEECH), Shanghai, China, 2020, pp. 304-308. [paper] [slides] [video]
Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux and Shinji Watanabe. End-To-End Multi-Speaker Speech Recognition With Transformer, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6129-6133. [paper]
Wangyou Zhang, Xuankai Chang and Yanmin Qian. Knowledge Distillation for End-to-End Monaural Multi-talker ASR System. In 20th Annual Conference of the International Speech Communication Association (INTERSPEECH), Graz, Austria, 2019, pp. 2633–2637. [paper] [slides]
Wangyou Zhang, Ying Zhou and Yanmin Qian. Robust DOA Estimation Based on Convolutional Neural Network and Time-Frequency Masking, In 20th Annual Conference of the International Speech Communication Association (INTERSPEECH), Graz, Austria, 2019, pp. 2703–2707. [paper] [slides]
Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux and Shinji Watanabe. MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition, In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, 2019, pp. 237-244. (Best Paper Award) [paper] [poster]
Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura and Wangyou Zhang. A Comparative Study on Transformer vs RNN in Speech Applications, In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, 2019, pp. 449-456. [paper]
Wangyou Zhang, Man Sun, Lan Wang and Yanmin Qian, End-to-End Overlapped Speech Detection and Speaker Counting with Raw Waveform. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, 2019, pp. 660-666. [paper] [poster]