Dr. Yong XU

Bio:

Now I am a Principal Researcher (supervised by Dr. Dong Yu) at Tencent America LLC, Bellevue, WA, USA . I once worked at the University of Surrey, UK as a Research Fellow (supervised by Prof. Mark D. Plumbley and Prof. Wenwu Wang) for two years. I got my Ph.D. degree from the University of Science and Technology of China (USTC) in 2015 and studied under a joint Ph.D. program at Georgia Institute of Technology (Georgia Tech, USA) during 2014-2015. My Ph.D. supervisors are Prof. Chin-Hui Lee (Georgia Tech, USA), Prof. Jun Du (USTC) and Prof. Li-Rong Dai (USTC). I won the 1st prize in DCASE 2017 challenge for "Large-scale weakly supervised sound event detection for smart cars". I have two ESI highly cited IEEE journal papers. I achieved the 2018 IEEE SPS Best paper award for my work on deep learning-based speech enhancement. I am the elected IEEE Signal Processing Society - Speech and Language Technical Committee (SLTC) member (2023-2025). I am one of the World's Top 2% Scientists in 2022 ranked by Stanford University. I am also one of the Elsevier 2022 & 2023 Most Cited Chinese Researchers. I am an IEEE Senior Member.

My Google Scholar: https://scholar.google.com/citations?user=nCmKPM4AAAAJ&hl=en (6000+ citations)

Email: yong.xu.ustc@gmail.com

News:

Two ICASSP2024 (Seoul, Korea) papers were accepted.
I am one of the Elsevier 2023 Most Cited Chinese Researchers. [link]
One ASRU2023 paper was accepted.
One Interspeech2023 (Dublin, Ireland) paper was accepted.
I will give an Interspeech2023 (Dublin, Ireland) tutorial [link] about neural beamforming with Dr. Shi-Xiong Zhang, Prof. Shinji Watanabe and Dr. Dong Yu
I am elevated to IEEE Senior Member (2023.2 - present)
I am elected as one of the IEEE Signal Processing Society - Speech and Language Technical Committee (SLTC) members (2023-2025)
I am one of the Elsevier 2022 Most Cited Chinese Researchers
One ICASSP2023 paper got accepted!
I am one of the World's Top 2% Scientists in 2022 ranked by Stanford University.
One SLT2022 paper got accepted!
Two Interspeech2022 papers got accepted!
Our ADL-MVDR paper got accepted by IEEE/ACM trans. on audio, speech, language processing, Nov. 2021
One ICASSP2022 paper got accepted!
Four Interspeech2021 papers got accepted!
Two ICASSP2021 papers got accepted!
Two SLT2021 papers got accepted!
Two Interspeech2020 papers got accepted!
One journal paper got accepted by IEEE/ACM trans. on audio, speech, language processing, July 2020
Two Journal papers were accepted by IEEE Journal of Selected Topics in Signal Processing, March 2020
Three ICASSP2020 (Barcelona, Spain) papers got accepted!
One ASRU2019 paper got accepted!
Three Interspeech2019 (Graz, Austria) papers got accepted!
One paper got accepted by IJCAI 2019
I achieved 2018 IEEE SPS Best paper award for the paper "A Regression Approach to Speech Enhancement Based on Deep Neural Networks", P. 7-19, Vol. 23, No. 1, IEEE/ACM trans. on audio, speech, language processing, 2015
Three ICASSP2019 (Brighton, UK) papers got accepted!
Four ICASSP2018 (Calgary, Canada) papers got accepted!
I won the 1st place in the IEEE DCASE2017 "large-scale weakly supervised sound event detection for smart cars" challenge. The detailed results and rank can be found here: http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-large-scale-sound-event-detection-results [media] [slides] [PDF]
One interspeech2017 paper got accepted!
Two ICASSP2017 papers got accepted!
Two interspeech2015 papers got accepted!
Two interspeech2014 papers got accepted!

Publications:

Google Scholar: https://scholar.google.co.uk/citations?user=nCmKPM4AAAAJ&hl=en Total Citations=6000+, h-index=38, i10-index=59

Journal papers:

[1] Multi-channel Multi-frame ADL-MVDR for Target Speech Separation

Z Zhang, Y Xu, M Yu, SX Zhang, L Chen, DS Williamson, D Yu, accepted to IEEE/ACM Trans. on audio, speech, language processing, Nov. 2021

[2] Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network

Ke Tan, Yong XU, Shixiong Zhang, Meng Yu, Dong Yu, IEEE Journal of Selected Topics in Signal Processing, 2020

[3] Multi-modal Multi-channel Target Speech Separation,

Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Lianwu Chen, Yuexian Zou, Dong Yu, IEEE Journal of Selected Topics in Signal Processing, 2020

[4] Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization

Qiuqiang Kong, Yong XU, Wengwu Wang, Mark D. Plumbley, accepted by IEEE/ACM trans. on audio, speech, language processing, July 2020

[5] Weakly Labelled AudioSet Tagging with Attention Neural Networks

Qiuqiang Kong, Changsong Yu, Yong Xu* (corresponding author), Turab Iqbal, Wenwu Wang, Mark D. Plumbley, accepted by IEEE/ACM trans. on audio, speech, language processing, 2019

[6] Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging

Yong Xu, Qiang Huang, Wenwu Wang, Peter Foster, Siddharth Sigtia, Philip J. B. Jackson, Mark D. Plumbley, accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing, July 2016

[7] A Regression Approach to Speech Enhancement Based on Deep Neural Networks. [2018 IEEE SPS Best paper award] [citations: 1000+] [ESI highly cited papers]

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, P. 7-19, Vol. 23, No. 1, IEEE/ACM trans. on audio, speech, language processing, 2015

[8] Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data

Qiuqiang Kong*, Yong Xu* (equal contribution) , Iwona Sobieraj, Wenwu Wang, Mark D. Plumbley, accepted by IEEE/ACM trans. on audio, speech, language processing, 2019

[9] An Experimental Study on Speech Enhancement Based on Deep Neural Networks. [citations: 800+] [ESI highly cited papers]

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, IEEE signal processing letters, p. 65-68, vol. 21, no. 1, January 2014

[10] Hierarchical deep neural network for multivariate regression

Jun Du and Yong Xu, Pattern Recognition, Volume 63, March 2017, Pages 149–157

[11] Joint training of DNNs by incorporating an explicit dereverberation structure for distant speech recognition

Tian Gao, Jun Du, Yong Xu, Cong Liu, Li-Rong Dai, Chin-Hui Lee, EURASIP Journal on Advances in Signal Processing, 2016

[12] Auxiliary Features from Laser-Doppler Vibrometer Sensor for Deep Neural Network Based Robust Speech Recognition

Lei Sun, Jun Du, Zhipeng Xie, Yong Xu, Journal of Signal Processing Systems, Springer, 2017

Conference papers:

[58] SpatialCodec: Neural Spatial Speech Coding

Zhongweiyang Xu, Yong Xu, Vinay Kothapally, Heming Wang, Muqiao Yang, Dong Yu, accepted to ICASSP2024

[57] uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models

Muqiao Yang, Chunlei Zhang, Yong Xu, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu, accepted to ICASSP2024

[56] NeuralEcho: Hybrid of Full-Band and Sub-Band Recurrent Neural Network for Acoustic Echo Cancellation and Speech Enhancement

Meng Yu, Yong Xu, Chunlei Zhang, Shi-Xiong Zhang, Dong Yu, accepted to ASRU2023

[55] Zoneformer: On-device Neural Beamformer For In-car Multi-zone Speech Separation, Enhancement and Echo Cancellation

Yong Xu, Vinay Kothapally, Meng Yu, Shi-Xiong Zhang, Dong Yu, accepted to Interspeech2023 (Dublin, Ireland)

[54] Deep Neural Mel-Subband Beamformer for In-car Speech Separation

V Kothapally, Yong Xu, M Yu, SX Zhang, D Yu, accepted to ICASSP2023

[53] EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers Maiti, Soumi; Ueda, Yushi; Watanabe, Shinji; zhang, chunlei ; Yu, Meng; Zhang, Shixiong; Yong, Xu, SLT2022

[52] Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer

V Kothapally, Yong Xu, M Yu, SX Zhang, D Yu, accepted to Interspeech2022

[51] Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter

Jinzheng Zhao, Peipei Wu, Xubo Liu, Shidrokh Goudarzi, Haohe Liu, YONG XU and Wenwu Wang, accepted to Interspeech2022

[50] Audio-Visual Tracking of Multiple Speakers via a PMBM Filter

Jinzheng Zhao, Peipei Wu, Xubo Liu, Yong Xu, Lyudmila Mihaylova, Simon Godsill, Wenwu Wang, ICASSP2022

[49] Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation

Yong Xu, Zhuohuang Zhang, Meng Yu, Shi-Xiong Zhang, Dong Yu, Interspeech2021

[48] MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation

Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu, Interspeech2021

[47] TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu, Interspeech2021

[46] MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment

Meng Yu, Chunlei Zhang, Yong Xu, Shixiong Zhang, Dong Yu, Interspeech2021

[45] WPD++: an improved neural beamformer for simultaneous speech separation and dereverberation

Zhaoheng Ni, Yong Xu, Meng Yu, Bo Wu, Shixiong Zhang, Dong Yu, Michael I Mandel , accepted to SLT2021

[44] Neural Mask based Multi-channel Convolutional Beamforming for Joint Dereverberation, Echo Cancellation and Denoising

Jianming Liu, Meng Yu, Yong Xu, Chao Weng, Shi-Xiong Zhang, Lianwu Chen, Dong Yu, accepted to SLT2021

[43] ADL-MVDR: All deep learning MVDR beamformer for target speech separation, [PDF] [Demo]

Zhuohuang Zhang, Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Dong Yu, accepted to ICASSP2021

[42] Neural Spatio-Temporal Beamformer for Target Speech Separation, [PDF] [Demo]

Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Chao Weng, Jianming Liu, Dong Yu, accepted to Interspeech2020

[41] Audio-visual Multi-channel Recognition of Overlapped Speech

Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng, accepted to Interspeech2020

[40] Far-Field Location Guided Target Speech Extraction using End-to-End Speech Recognition Objectives

Aswin Shanmugam Subramanian, Chao Weng, Meng Yu, Shi-Xiong Zhang, Yong Xu, Shinji Watanabe and Dong Yu, ICASSP2020

[39] Enhancing End-To-End Multi-channel Speech Separation via Spatial Feature Learning,

Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu, ICASSP2020

[38] Self-supervised learning for audio-visual speaker diarization,

Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang, ICASSP2020

[37] Time Domain Audio Visual Speech Separation,

Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, Dong Yu, ASRU2019

[36] Improved Speaker-Dependent Separation for CHiME-5 Challenge,

Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, Dong Yu, Interspeech2019

[35] Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information

Rongzhi Gu, Lianwu Chen, Shi-Xiong Zhang, Jimeng Zheng, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu,, Interspeech2019

[34] A comprehensive study of speech separation: spectrogram vs waveform separation

Fahimeh Bahmaninezhad, Jian Wu, Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Interspeech2019

[33] Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks,

Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark Plumbley, Philip Jackson, accepted to IJCAI2019 (accept rate=18%)

[32] Joint training of complex ratio mask based beamformer and acoustic model for noise robust,

YONG XU, CHAO WENG, LIKE HUI, JIANMING LIU, MENG YU, DAN SU, DONG YU, accepted to ICASSP2019

[31] Acoustic scene generation with conditional sampleRNN,

Qiuqiang Kong, Yong Xu, Turab Iqbal, Yin Cao, Wenwu Wang, Mark Plumbley, accepted to ICASSP2019

[30] An attention-based neural network approach for single channel speech enhancement,

Xiang Hao, Changhao Shan, Yong Xu, Sining Sun, Lei Xie, accepted to ICASSP2019

[29] Large-scale weakly supervised audio classification using gated convolutional neural network, [pdf] [Rank 1st system in DCASE2017 challenge]

Yong Xu, Qiuqiang Kong, Wenwu Wang and Mark D. Plumbley, accepted to ICASSP2018

[28] A joint separation-classification model for sound event detection of weakly labelled data

Qiuqiang Kong, Yong Xu (* equal contribution), Wenwu Wang and Mark D. Plumbley, accepted to ICASSP2018

[27] Audio Set classification with attention model: A probabilistic perspective

Qiuqiang Kong, Yong Xu (* equal contribution), Wenwu Wang and Mark D. Plumbley, accepted to ICASSP2018

[26] Iterative deep neural networks for speaker-independent binaural blind speech separation

QINGJU LIU, YONG XU, PHILIP COLEMAN, PHILIP JACKSON, WENWU WANG, accepted to ICASSP2018

[25] Intelligent signal processing mechanisms for nuanced anomaly detection in action audio-visual data streams

Josef Kittler, Ioannis Kaloskampis, Cemre Zor, Yong Xu*, Yulia Hicks and Wenwu Wang, accepted to ICASSP2018

[24] Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging,

Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang and Mark D. Plumbley, accepted to Interspeech2017

[23] Joint Detection and Classification Convolutional Neural Network (JDC-CNN) on Weakly Labelled Bird Audio Data (BAD)

Qiuqiang Kong, Yong Xu, Mark D. Plumbley, accepted to EUSIPCO2017

[22] Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging

Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang and Mark D. Plumbley, IJCNN2017

[21] A joint detection-classification model for audio tagging of weakly labelled data

Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark Plumbley, ICASSP2017

[20] Fast Tagging of Natural Sounds Using Marginal Co-regularization

Qiang Huang, Yong Xu, Philip J. B. Jackson, Wenwu Wang, Mark D. Plumbley, ICASSP2017

[19] Binaural and Log-Power Spectra Features with Deep Neural Networks for Speech-Noise Separation

Alfredo Zermini, Qingju Liu, Yong Xu, Mark D. Plumbley, Dave Betts, Wenwu Wang, MMSP2017

[18] Deep neural network based audio source separation

A. Zermini and Y. Yu and Yong Xu and W. Wang and M. D. Plumbley, 11th IMA International Conference on Mathematics in Signal Processing, 2016

[17] Fully DNN-based Multi-label regression for audio tagging.

Yong Xu, Qiang Huang, Wenwu Wang, Philip J B Jackson, Mark D Plumbley, accepted by DCASE2016 workshop, July 2016

[16] Hierarchical learning for DNN-based acoustic scene classification

Yong Xu, Qiang Huang, Wenwu Wang, Mark D. Plumbley, accepted by DCASE2016 workshop, July 2016

[15] Deep Neural Network for Robust Speech Recognition With Auxiliary Features From Laser-Doppler Vibrometer Sensor. Xie, Zhi-Peng and Du, Jun and McLoughlin, Ian Vince and Xu, Yong and Ma, Feng and Wang, Haikun. ISCSLP2016

[14] Multi-objective learning and Mask-based Post-processing for Deep Neural Network based Speech Enhancement.

Yong Xu, Jun Du, Zhen Huang, Li-Rong Dai, Chin-Hui Lee, accepted, Interspeech2015, Dresden, Germany

[13] DNN-Based Speech Bandwidth Expansion and Its Application to Adding High Frequency Missing Features for Automatic Speech Recognition of Narrowband Speech.

Kehuang Li, Zhen Huang, Yong Xu and Chin-Hui Lee, accepted, Interspeech2015, Dresden, Germany

[12] Dynamic Noise Aware Training for Speech Enhancement Based on Deep Neural Networks.

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, Interspeech2014, Singapore

[11] Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments. (Best paper candidate)

Tian Gao, Jun Du, Yong Xu, Cong Liu, Li-Rong Dai, Chin-Hui Lee, accepted, LVA/ICA 2015, Liberec, Czech Republic

[10] Robust Speech Recognition with Speech Enhanced Deep Neural Networks

Jun Du, Qing Wang, Tian Gao, Yong Xu, Li-Rong Dai and Chin-Hui Lee, Interspeech2014, Singapore

[9] Cross-language Transfer Learning for Deep Neural Network Based Speech Enhancement

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, ISCSLP2014, Singapore

[8] Speech Separation Based on Improved Deep Neural Networks with Dual Outputs of Speech Features for both Target and Interfering Speakers, Yanhui Tu, Jun Du, Yong Xu, Lirong Dai and Chin-Hui Lee, ISCSLP2014, Singapore

[7] Speech separation of a target speaker based on deep neural networks.

Jun Du, Yanhui Tu, Yong Xu, Li-Rong Dai and Chin-Hui Lee, P. 532 – 536, ICSP2014, Hangzhou, China

[6] Deep neural network based speech separation for robust speech recognition.

Yanhui Tu, Jun Du, Yong Xu, Lirong Dai and Chin-Hui Lee, P. 532 – 536, Hangzhou, China

[5] Global Variance Equalization for Improving Deep Neural Network Based Speech Enhancement.

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, to be appeared at ChinaSIP2014, Xi’an, China

[4] Spoken Term Detection for OOV Terms Based on Phone Fragment.

Yong Xu, Wu Guo, Shan Su and Li-Rong Dai, ICALIP2012, Shanghai, China

[3] Improved Spoken Term Detection by Template-based Confidence Measure.

Shan Su, Wu Guo, Yong Xu and Li-Rong Dai, ICALIP2012, Shanghai, China

[2] A hybrid fragment / syllable-based system for improved OOV term detection.

Yong Xu, Wu Guo and Li-Rong Dai, ISCSLP2012, Hong Kong

[1] Spoken term detection for OOV terms based on tri-phone confusion matrix.

Yong Xu, Wu Guo and Li-Rong Dai, ISCSLP2012, Hong Kong

Patent:

[1] Speech separation method and system, US patent, US 20160189730A1

DU Jun, XU Yong, TU Yanhui, Dai Li-rong, Wang Zhiguo, HU Yu, Liu Qingfeng, June 2016

Research Experience:

Tencent America LLC, Bellevue, WA, USA Principle Research Scientist 2021 – present Multi-channel speech enhancement/separation/de-reverberation/speech recognition, I proposed ADL-MVDR/RNN beamformer

Tencent America LLC, Bellevue, WA, USA Senior Research Scientist 2018 – 2021 Multi-modality speech enhancement/separation/de-reverberation/speech recognition

University of Surrey, Guildford, UK Full-time Research Fellow 2016 – 2018 Deep learning (DNN/CNN/LSTM, etc) based environmental sound classification and analysis.

Georgia Institute of Technology, USA Visiting Student 2014– 2015

Deep neural networks based speech enhancement and used for the automatic speech recognition (ASR), and my advisor is Prof. Chin-Hui Lee.

Bosch - research center, CA, USA Short Internship Oct. 2014– Nov. 2014

Deep neural networks based speech enhancement and used for the automatic speech recognition (ASR)

Speech Lab, USTC, China Jul. 2012 – Jun. 2015

--- DNN based speech enhancement, cooperated with Prof. Chin-Hui Lee (Georgia Tech)

--- I developed a Large Vocabulary Continuous Speech Recognition (LVCSR) system trained on 2300h English speech database, and built a baseline for OOV term detection. MLE, DT, Tandem systems were built.

Speech Lab, USTC, Hefei, China Graduate student Sept. 2010 – Jul. 2012

Working on Spoken Term Detection (STD) for Out-Of-Vocabulary (OOV) words, I use the tri-phone confusion matrix and a hybrid fragment / syllable system to improve the performance of OOV term detection.

Speech Lab, USTC, Hefei, China Undergraduate student Mar. 2010 – Jul. 2010

I did the project of my undergraduate thesis about room acoustic impulse response.