Publications

Journal Paper:

John H.L. Hansen, Gang Liu, “Unsupervised accent classification for deep data fusing of acoustic and language information,” Speech Communication, vol. 78, Apr. 2016, pp. 19-33. [pdf] [bib][corpus][Impact Factor: 1.256][Download Data][Link]
Gang Liu, John H. L. Hansen, "An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy Enrollment Scenarios," Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 22, no. 12, pp. 1978-1992 Dec. 2014. [pdf] [bib] [cover article] [Rank No.4 in Dec, 2014][Impact Factor: 2.475]
Gang Liu, Datian Ye, "Research on the Chinese Initial Pronunciation Computer-Aided Language Learning System," Beijing Biomedical Engineering J., 2008, 27(1):88-92 [pdf] [bib]

Conferences Paper:

Siqi Zheng, Gang Liu, Hongbin Suo and Yun Lei, "Towards A Fault-tolerant Speaker Verification System: A Regularization Approach To Reduce The Condition Number", accepted by Proc. of Interspeech2019, Graz, Austria, Sep. 15-19, 2019
Siqi Zheng, Gang Liu, Hongbin Suo and Yun Lei, "Autoencoder-based Semi-Supervised Curriculum Learning For Out-of-domain Speaker Verification", accepted by Proc. of Interspeech2019, Graz, Austria, Sep. 15-19, 2019
Fei Tao, Gang Liu, Qingen Zhao, “AN ENSEMBLE FRAMEWORK OF VOICE-BASED EMOTION RECOGNITION SYSTEM,” in Proc. ACII Asia 2018, Beijing, China, May 20-22, 2018. [Link]
Fei Tao and Gang Liu, “Advanced LSTM: A study about better time dependency modeling in emotion recognition,” in arXiv preprint arXiv:1710.10197, 2017. [arxiv][Accepted by ICASSP2018]
Fei Tao, Gang Liu, Qingen Zhao, "AN ENSEMBLE FRAMEWORK OF VOICE-BASED EMOTION RECOGNITION SYSTEM FOR FILMS AND TV PROGRAMS," [arxiv] Acceptd by ICASSP2018.
Gang Liu, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin and Tuo Zhao, "The Opensesame NIST 2016 speaker Recognition Evaluation System", in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 20-24, 2017, pp.2854-2858 [pdf] [bib] [poster]
Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Anthony Larcher, Chunlei Zhang, Andreas Nautsch, Themos Stafylakis, Gang Liu, Mickael Rouvier, Wei Rao, Federico Alegre, Jianbo Ma, Manwai Mak, Achintya Kumar Sarkar, Héctor Delgado, Rahim Saeidi, Hagai Aronowitz, Aleksandr Sizov, hanwu sun, Guangsen Wang, Trung Hieu Nguyen, Bin Ma, Ville Vestman, Md Sahidullah, Miikka Halonen, Anssi Kanervisto, Gael Le Lan, Fahimeh Bahmaninezhad, Sergey Isadskiy, Christian Rathgeb, Christoph Busch, Georgios Tzimiropoulos, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin, Tuo Zhao, Pierre-Michel Bousquet, Moez Ajili, waad ben kheder, Driss Matrouf, Zhi Hao Lim, Chenglin Xu, Haihua Xu, Xiong Xiao, Eng Siong Chng, Benoit Fauve, Vidhyasaharan Sethu, Kaavya Sriskandaraja, W. W. Lin, Zheng-Hua Tan, Dennis Alexander Lehmann Thomsen, Massimiliano Todisco, Nicholas Evans, Haizhou Li, John H.L. Hansen, Jean-Francois Bonastre and Eliathamby Ambikairajah, "The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016", in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 20-24, 2017, pp.1328-1332 [pdf] [bib]
Chunlei Zhang, Shivesh Ranjan, Mahesh Kumar Nandwana, Qian Zhang, Abhinav Misra, Gang Liu, Finnian Kelly, John Hansen, "Joint information from Nonlinear and linear features for spoofing detection: an i-vector/DNN based approach", in Proc. ICASSP, Shanghai, China, Mar. 2016 (the IEEE Ganesh N. Ramaswamy Memorial Student Grant) [pdf][bib]
Shivesh Ranjan, Gang Liu, John H. L. Hansen, "An I-Vector PLDA Based Gender Identification Approach for Severely Distorted and Multilingual DARPA Rats Data," in Proc. ASRU, Scottsdale, AZ, Dec. 13-17, 2015. [pdf] [bib] [poster]
Chunlei Zhang, Gang Liu, Chengzhu Yu, John H.L. Hansen, "i-Vector Based Physical Task Stress Detection with Different Fusion Strategies," in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015 [pdf] [bib]
Hua Xing, Gang Liu, John H.L. Hansen, "Frequency Offset Correction in Single Sideband (SSB) Speech Based on Deep Neural Networks for Speaker Verification, " in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015 [pdf] [bib]
Muhammad Muneeb Saleem, Gang Liu, John H.L. Hansen, “Weighted Training for Speech under Lombard Effect for Speaker Recognition,” in Proc. ICASSP, Brisbane, Australia, Apr. 2015 [pdf][bib]
Chunlei Zhang, Qian Zhang, Shivesh Ranjan, Mahesh Kumar Nandwana, Abhinav Misra, Gang Liu, Finnian Kelly, John H. L. Hansen, "ASVspoof 2015: UTD-CRSS System Description".
Gang Liu, Chengzhu Yu, Navid Shokouhi, Abhinav Misra, Hua Xing, John Hansen, “Utilization of unlabeled development data for speaker verification”, in Proc. IEEE Spoken Language Technology Workshop (SLT 2014), South Lake Tahoe, Nevada, Dec 7-10, 2014, pp.418-423. [PDF][bib]
Chengzhu Yu, Gang Liu, John H. L. Hansen,“Acoustic Feature Transformation using UBM-based LDA for Speaker Recognition,” in Proc. Interspeech 2014, Singapore, Sep. 2014, pp. 1851–1854. [PDF] [bib]
Gang Liu, Chengzhu Yu, Abhinav Misra, Navid Shokouhi and John H.L. Hansen, "Investigating State-of-the-Art Speaker Verification in the Case of Unlabeled Development Data," in Proc. Odyssey 2014, The speaker and language recognition workshop, Joensuu, Finland, June 2014, pp. 118-122. [PDF][bib]
Gang Liu and John H.L. Hansen, "Supra-Segmental Feature Based Speaker Trait Detection," in Proc. Odyssey 2014, The speaker and language recognition workshop, Joensuu, Finland, June 2014.[PDF] [bib]
Qian Zhang, Gang Liu, and John H. L. Hansen, “Robust Language Recognition Based on Diverse Features,” in Proc. Odyssey 2014, The speaker and language recognition workshop, Joensuu, Finland, June 2014. [PDF] [bib]
Chengzhu Yu, Gang Liu, Seongjun Hahm, and John H.L. Hansen, "Uncertainty Propagation in Front End Factor Analysis For Noise Robust Speaker Recognition," in Proc. ICASSP, Florence, Italy, May 2014, pp. 4045-4049. [PDF] [bib]
Gang Liu, Dimitrios Dimitriadis and Enrico Bocchieri, "Robust speech enhancement techniques for ASR in non-stationary noise and dynamic environments", in Proc. INTERSPEECH, Lyon, France, 25-29 Aug.,2013.pp. 3017-3021 [PDF] [bib] [Demo][Poster]
Ville Hautamaki, Kong Aik Lee, David van Leeuwen, Rahim Saeidi, Anthony Larcher, Tomi Kinnunen, Taufiq Hasan, Seyed Omid Sadjadi, Gang Liu, Hynek Boril, John H.L. Hansen and Benoit Fauve, "Automatic regularization of cross-entropy cost for speaker recognition fusion", in Proc. INTERSPEECH, Lyon, France, 25-29 Aug.,2013. [PDF][bib]
Rahim Saeidi, Kong Aik Lee, Tomi Kinnunen, Taufiq Hasan,Benoit Fauve, Pierre-Michel Bousquet, Elie Khoury, Pablo L. Sordo, Martinez, Karen Kua, Changhuai You, hanwu sun, Anthony Larcher, Paddy Rajan, Ville Hautamaki, Cemal Hanilci, Billy Braithwaite, Rosa Gonzales-Hautamaki, Seyed Omid Sadjadi, Gang Liu, and Hynek Boril, "I4U submission to NIST SRE 2012: A large-scale collaborative effort for noise-robust speaker verification", in Proc. INTERSPEECH, Lyon, France, 25-29 Aug.,2013. [PDF][bib]
Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner, "Crowd++: Unsupervised Speaker Count with Smartphones," The 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (ACM UbiComp), Zurich, Switzerland, September 9-12, 2013. pp.43-52. Acceptance Rate: 18% (71 out of 395). [PDF][bib][slides][poster][code][UbiComp Ranking: Top 4 conferences in human-computer interaction]
Gang Liu, Taufiq Hasan, Hynek Boril, John H.L. Hansen, "An investigation on back-end for speaker recognition in multi-session enrollment", in Proc. ICASSP, Vancouver, Canada, May 25-31, 2013. pp. 7755-7759. [PDF] [bib][talk][code]
Taufiq Hasan, Seyed O. Sadjadi, Gang Liu, Navid Shokouhi, Hynek Boril, John H.L. Hansen, "CRSS systems for 2012 NIST speaker recognition evaluation", in Proc. ICASSP, Vancouver, Canada, pp. 6783-6787, 2013. (Best Paper Award) [PDF] [bib] (LINK)
Gang Liu, Chi Zhang, John H.L. Hansen, "A Linguistic Data Acquisition Front-End for Language Recognition Evaluation", in Proc. Odyssey, Singapore, pp. 224-228, 25-28 June 2012. [pdf] [bib]
Gang Liu, Jun-Won Suh, John H.L. Hansen, "A fast speaker verification with universal background support data selection", in Proc. ICASSP2012, Kyoto, Japan, pp.4793-4796, 2012. [pdf] [bib]
Gang Liu, Yun Lei, John H.L. Hansen, "Robust feature front-end for speaker identification", in Proc. ICASSP, Kyoto, Japan, pp.4233-4236, 2012. [pdf] [bib]
Jun-Won Suh, Seyed O. Sadjadi, Gang Liu, Taufiq Hasan, Keith W. Godin, and John H.L. Hansen, "Exploring Hilbert envelope based acoustic features in i-vector speaker verification using HT-PLDA", SRE2011 Workshop,Atlanta, USA [pdf] [bib]
Tauhidur Rahman, Soroosh Mariooryad, Shalini Keshavamurthy,Gang Liu , John H.L. Hansen, and Carlos Busso, "Detecting sleepiness by fusing classifiers trained with novel acoustic features",in Proc. INTERSPEECH, Florence, Italy, Aug. 2011, pp.3285-3288 [pdf] [bib]
Gang Liu, John H. L. Hansen. "A systematic strategy for robust automatic dialect identification", EUSIPCO2011, Barcelona, Spain, 2011. pp.2138-2141 [pdf] [bib]
Gang Liu, Yun Lei, John H.L. Hansen, "Dialect Identification: Impact of difference between Read versus spontaneous speech", EUSIPCO-2010. Aalborg, Denmark, 2010. pp.2003-2006 [pdf] [bib]
Gang Liu, Yun Lei, John H.L. Hansen, "A Novel Feature Extraction Strategy for Multi-stream Robust Emotion Identification", INTERSPEECH-2010. Makuhari Messe, Japan, 2010. pp.482-485 [pdf] [bib]
Yang Xiao, Gang Liu, "A New Solution for the Design of Sliding Mode Control", 2001 ICII Proceedings, IEEE Press, Beijing, China, pp.221- 226 [pdf] [bib]

Workshop Presentation:

Gang Liu, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin, Tuo Zhao, "OpenSesame: Alibaba System for NIST SRE216", NIST 2016 Speaker Recognition Evaluation Workshop, San Diego, CA, Dec. 11-12, 2016,
Gang Liu, Chengzhu Yu, Navid Shokouhi, Abhinav Misra, Hua Xing, John H. L. Hansen,. "CRSS systems for the NIST i-Vector Machine Learning Challenge", Odyssey 2014, The speaker and language recognition workshop, Joensuu, Finland, June 2014.
Taufiq Hasan, Gang Liu, Seyed Sadjadi, Navid Shokouhi, Ali Ziaei, Abhinav Misra, Keith W. Godin, and John H.L. Hansen, “UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation”, NIST 2012 SRE Workshop, Orlando, USA, 11-12 Dec. 2012. [pdf] [bib]
Gang Liu, Seyed Omid Sadjadi, Taufiq Hasan, Jun-Won Suh, Chi Zhang, Mahnoosh Mehrabani, Hynek Boˇril, Abhijeet Sangwan, and John H. L. Hansen, "UTD-CRSS systems for NIST language recognition evaluation 2011", NIST 2011 Language Recognition Evaluation Workshop, Atlanta, USA, 6-7 Dec. 2011. [pdf] [bib]
Yun Lei, Taufiq Hasan, Jun-Won Suh, Abhijeet Sangwan, Hynek Boril, Gang Liu, Keith Godin, Chi Zhang, and John H. L. Hansen, (2010): “The CRSS Systems for the 2010 NIST Speaker Recognition Evaluation,” NIST 2010 Speaker Recognition Evaluation Workshop, Brno, Czech Republic, 24-25 Jun. 2010. [pdf] [bib]

Master thesis:

Gang Liu, Datian Ye, "Research on Mandarin Computer Aided Language Learning System for Deaf Children", Tsinghua University, Beijing, China 2007 - [pdf] [talk]

(The following is only a re-arrangement of the above publications by focus)

Publications sort by focus:

Data Selection

Gang Liu, Chi Zhang, John H.L. Hansen, "A Linguistic Data Acquisition Front-End for Language Recognition Evaluation", in Proc. Odyssey, Singapore, pp. 224-228, 25-28 June 2012. [pdf] [bib]
Gang Liu,Jun-Won Suh, John H.L. Hansen, "A fast speaker verification with universal background support data selection", Proc. ICASSP2012, Kyoto, Japan, 2012. pp.4793-4796 [pdf]
Gang Liu, Yun Lei, John H.L. Hansen, "Dialect Identification: Impact of difference between Read versus spontaneous speech", EUSIPCO-2010. Aalborg, Denmark, 2010. pp.2003-2006 [pdf]

A Linguistic Data Acquisition Front-End for Language Recognition Evaluation

Odyssey 2012

Abstract: One of the major challenges of the language identification (LID)system comes from the sparse training data. Manually collecting the linguistic data through the controlled studio is usually expensive and impractical. But multilingual broadcast programs (Voice of America, for instance) can be collected as a reasonable alternative to the linguistic data acquisition issue. However, unlike studio collected linguistic data, broadcast programs usually contain many contents other than pure linguistic data: musical contents in foreground/background, commercials, noise from practical life. In this study, a systematic processing approach is proposed to extract the linguistic data from the broadcast media. The experimental results obtained on NIST LRE 2009 data show that the proposed method can provide 22.2% relative improvement of segmentation accuracy and 20.5% relative improvement of LID accuracy.
Keywords: Data acquisition, purification, language identification

A fast speaker verification with universal background support data selection

ICASSP 2012

Abstract: In this study, a fast universal background support imposter data selection method is proposed, which is integrated within a support vector machine (SVM) based speaker verification system. Selection of an informative background dataset is crucial in constructing a discriminative decision super-plane between the enrollment and imposter speakers. Previous studies generally derive the optimal number of imposter examples from development data and apply to the evaluation data, which cannot guarantee consistent performance and often necessitate expensive searching. In the proposed method, the universal background dataset is derived so as to embed imposter knowledge in a more balanced way. Next, the derived dataset is taken as the imposter set in the SVM modeling process for each enrollment speaker. By using imposter adaptation, a more detailed subspace per target speaker can be constructed. Compared to the popular support-vector frequency based method, the proposed method can not only avoid parameter searching but offers a significant improvement and generalizes better on the unseen data.
Keywords: speaker verification, universal background dataset selection, adaptation, SVM, UBS

Dialect Identification: Impact of Differences between Read versus Spontaneous Speech

EUSIPCO 2010

Abstract: Automatic Dialect Classification (ADC) has recently gained substantial interest in the field of speech processing. Dialects of a language normally are reflected in terms of their phoneme space, word pronunciation/selection, and prosodic traits. These traits are clearly visible in natural speaker-to-speaker spontaneous conversations. However, dialect cues in prompted/read speech are often neglected by the community. In this study, we consider a systematic assessment of the differences between the acoustic characteristics of spontaneous and read speech and their effects on dialect identification performance. By examining both the model space and phoneme space of read and spontaneous dialect speech, we observe that each spans different dialect spaces and with distinct characteristics that need to be addressed respectively. From this comparison, we find useful clues to design more efficient identification systems. Finally, we also propose a novel feature extraction technique, PMVDR-SDC, and obtain a +26.4% relative improvement in dialect recognition rate.
Keywords: Automatic Dialect Classification, PMVDR, Read, Spontaneous

Robust Acoustic Feature

Gang Liu, Yun Lei, John H.L. Hansen, "Robust feature front-end for speaker identification", ICASSP2012, Kyoto, Japan, 2012. pp.4233-4236 [pdf]
Jun-Won Suh, Seyed Omid Sadjadi, Gang Liu , Taufiq Hasan, Keith W. Godin, and John H.L. Hansen, "Exploring Hilbert envelope based acoustic features in i-vector speaker verification using HT-PLDA", SRE2011 Workshop,Atlanta, USA [pdf]
Tauhidur Rahman, Soroosh Mariooryad, Shalini Keshavamurthy,Gang Liu , John H.L. Hansen, and Carlos Busso, "Detecting sleepiness by fusing classifiers trained with novel acoustic features",INTERSPEECH-2011,pp.3285-3288 [pdf]
Gang Liu, John H. L. Hansen. "A systematic strategy for robust automatic dialect identification", EUSIPCO2011, Barcelona, Spain, 2011. pp.2138-2141 [pdf]
Gang Liu, Yun Lei, John H.L. Hansen, "A Novel Feature Extraction Strategy for Multi-stream Robust Emotion Identification", INTERSPEECH-2010. Makuhari Messe, Japan, 2010. pp.482-485 [pdf]

CALL (Computer Aided Language Learning system)

Gang Liu , Datian Ye, "Research on the Chinese Initial Pronunciation Computer-Aided Language Learning System", Beijing Biomedical Engineering J., 2008, 27(1):88-92 [pdf]
Gang Liu, Datian Ye, "Research on Mandarin Computer Aided Language Learning System for Deaf Children", Tsinghua University, Beijing, China 2007 - [pdf] [talk]

Automation

Yang Xiao, Gang Liu, "A new solution for the Design of sliding mode control", 2001 ICII Proceedings, IEEE Press, Beijing, China, p221- 226 [pdf]

Speaker Identification

Chunlei Zhang, Gang Liu, Chengzhu Yu, John H.L. Hansen, "i-Vector Based Physical Task Stress Detection with Different Fusion Strategies," in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015
Hua Xing, Gang Liu, John H.L. Hansen, "Frequency Offset Correction in Single Sideband (SSB) Speech Based on Deep Neural Networks for Speaker Verification, " in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015
Muhammad Muneeb Saleem, Gang Liu, John H.L. Hansen, “Weighted Training for Speech under Lombard Effect for Speaker Recognition,” in Proc. ICASSP, Brisbane, Australia, Apr. 2015 [pdf][bib]
Chunlei Zhang, Qian Zhang, Shivesh Ranjan, Mahesh Kumar Nandwana, Abhinav Misra, Gang Liu, Finnian Kelly, John H. L. Hansen, "ASVspoof 2015: UTD-CRSS System Description".
Gang Liu, Chengzhu Yu, Navid Shokouhi, Abhinav Misra, Hua Xing, John Hansen, “Utilization of unlabeled development data for speaker verification”, in Proc. IEEE Spoken Language Technology Workshop (SLT 2014), South Lake Tahoe, Nevada, Dec 7-10, 2014, pp.418-423. [PDF][bib]
Chengzhu Yu, Gang Liu, John H. L. Hansen,“Acoustic Feature Transformation using UBM-based LDA for Speaker Recognition,” in Proc. Interspeech 2014, Singapore, Sep. 2014, pp. 1851–1854. [PDF] [bib]
Gang Liu, Chengzhu Yu, Abhinav Misra, Navid Shokouhi and John H.L. Hansen, "Investigating State-of-the-Art Speaker Verification in the Case of Unlabeled Development Data," in Proc. Odyssey 2014, The speaker and language recognition workshop, Joensuu, Finland, June 2014, pp. 118-122. [PDF][bib]
Gang Liu and John H.L. Hansen, "Supra-Segmental Feature Based Speaker Trait Detection," in Proc. Odyssey 2014, The speaker and language recognition workshop, Joensuu, Finland, June 2014.[PDF] [bib]
Chengzhu Yu, Gang Liu, Seongjun Hahm, and John H.L. Hansen, "Uncertainty Propagation in Front End Factor Analysis For Noise Robust Speaker Recognition," in Proc. ICASSP, Florence, Italy, May 2014, pp. 4045-4049. [PDF] [bib]
Ville Hautamaki, Kong Aik Lee, David van Leeuwen, Rahim Saeidi, Anthony Larcher, Tomi Kinnunen, Taufiq Hasan, Seyed Omid Sadjadi, Gang Liu, Hynek Boril, John H.L. Hansen and Benoit Fauve, "Automatic regularization of cross-entropy cost for speaker recognition fusion", in Proc. INTERSPEECH2013, Lyon, France, 25-29 Aug.,2013. [PDF][bib]
Rahim Saeidi, Kong Aik Lee, Tomi Kinnunen, Taufiq Hasan,Benoit Fauve, Pierre-Michel Bousquet, Elie Khoury, Pablo L. Sordo, Martinez, Karen Kua, Changhuai You, hanwu sun, Anthony Larcher, Paddy Rajan, Ville Hautamaki, Cemal Hanilci, Billy Braithwaite, Rosa Gonzales-Hautamaki, Seyed Omid Sadjadi, Gang Liu, and Hynek Boril, "I4U submission to NIST SRE 2012: A large-scale collaborative effort for noise-robust speaker verification", in Proc. INTERSPEECH, Lyon, France, 25-29 Aug.,2013. [PDF][bib]
Gang Liu, Taufiq Hasan, Hynek Boril, John H.L. Hansen, "An investigation on back-end for speaker recognition in multi-session enrollment", in Proc. ICASSP, Vancouver, Canada, pp. 7755-7759, 2013. [PDF] [bib][talk][code]
Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner, "Crowd++: Unsupervised Speaker Count with Smartphones," The 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (ACM UbiComp), Zurich, Switzerland, September 9-12, 2013. pp.43-52. Acceptance Rate: 18% (71 out of 395). [PDF][bib][slides][poster][code]
Gang Liu, Taufiq Hasan, Hynek Boril, John H.L. Hansen, "An investigation on back-end for speaker recognition in multi-session enrollment", in Proc. ICASSP, Vancouver, Canada, May 25-31, 2013. pp. 7755-7759. [PDF] [bib][talk][code]
Taufiq Hasan, Seyed O. Sadjadi, Gang Liu, Navid Shokouhi, Hynek Boril, John H.L. Hansen, "CRSS systems for 2012 NIST speaker recognition evaluation", in Proc. ICASSP, Vancouver, Canada, pp. 6783-6787, 2013. (Best Paper Award) [PDF] [bib] (LINK)
Gang Liu, Jun-Won Suh, John H.L. Hansen, "A fast speaker verification with universal background support data selection", Proc. ICASSP2012, Kyoto, Japan, 2012. pp.4793-4796 [pdf] [bib]
Gang Liu, Yun Lei, John H.L. Hansen, "Robust feature front-end for speaker identification", ICASSP2012, Kyoto, Japan, 2012. pp.4233-4236 [pdf] [bib]
Jun-Won Suh, Seyed O. Sadjadi, Gang Liu, Taufiq Hasan, Keith W. Godin, and John H.L. Hansen, "Exploring Hilbert envelope based acoustic features in i-vector speaker verification using HT-PLDA", SRE2011 Workshop,Atlanta, USA [pdf] [bib]

CRSS systems for 2012 NIST speaker recognition evaluation

ICASSP 2013

Abstract: This paper describes the systems developed by the Center for Robust Speech Systems (CRSS), for the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE). Given that the emphasis of SRE’12 is on noisy and short duration test conditions, our system development focused on: (i) novel robust acoustic features, (ii) new feature normalization schemes, (iii) various back-end strategies utilizing multi-session and multi-condition training, and (iv) quality measure based system fusion. Noisy and short duration training/test conditions are artificially generated and effectively utilized. Active speech duration and signal-to-noise-ratio (SNR) estimates are successfully employed as quality measures for system calibration and fusion. Overall system performance was very successful for the given test conditions.
Keywords: Feature normalization, NIST SRE, robust features,speaker verification, quality measure fusion, back-end

An investigation on back-end for speaker recognition in multi-session enrollment

ICASSP 2013

Abstract: This study explores various back-end classifiers for robust speaker recognition in multi-session enrollment, with emphasis on optimal utilization and organization of speaker information present in the development data. Our objective is to construct a highly discriminative back-end framework by fusing several back-ends on an i-vector system framework. It is demonstrated that, by using different information/data configuration and modeling schemes, performance of the fused system can be significantly improved compared to an individual system using a single front-end and back-end. Averaged across both genders, we obtain a relative improvement in EER and minDCF by 56.5% and 49.4%, respectively. Consistent performance gains obtained using the proposed strategy validates its effectiveness. This system is part of the CRSS’ NIST SRE 2012 submission system.
Keywords: Universal Background Support, PLDA, speaker recognition, GCDS, classification algorithms, UBSSVM

A fast speaker verification with universal background support data selection

ICASSP 2012

Abstract: In this study, a fast universal background support imposter data selection method is proposed, which is integrated within a support vector machine (SVM) based speaker verification system. Selection of an informative background dataset is crucial in constructing a discriminative decision super-plane between the enrollment and imposter speakers. Previous studies generally derive the optimal number of imposter examples from development data and apply to the evaluation data, which cannot guarantee consistent performance and often necessitate expensive searching. In the proposed method, the universal background dataset is derived so as to embed imposter knowledge in a more balanced way. Next, the derived dataset is taken as the imposter set in the SVM modeling process for each enrollment speaker. By using imposter adaptation, a more detailed subspace per target speaker can be constructed. Compared to the popular supportvector frequency based method, the proposed method can not only avoid parameter searching but offers a significant improvement and generalizes better on the unseen data.
Keywords: speaker verification, universal background dataset selection, adaptation, SVM, UBS, UBSSVM

Robust feature front-end for speaker identification

ICASSP 2012

Abstract: One important challenge for speaker identification (SID) system is sustained performance in diverse conditions. This study presents a novel front-end feature extraction method for SID in clean, noisy, and channel-mismatched acoustic conditions. To address the problem, the perceptual minimum variance distortionless response (PMVDR) feature is employed. While PMVDR has been successfully used for noisy ASR, it has not been considered for SID. We also incorporate longer temporal speaker knowledge based on the shifted delta cepstral (SDC) algorithm. The evaluation over YOHO and another new diversified Robust Open-Set Speaker Identification (ROSSI) database show that both PMVDR and the union with SDC can improve performance significantly. Compared with traditional feature extraction, PMVDR and PMVDR-SDC always give improvement across diverse adverse conditions. Also, PMVDR-SDC can contribute additional improvement in the presence of noise and channel mismatch.
Keywords: PMVDR, SDC, speaker identification, noise, robustness, robust front-end

Exploring Hilbert envelope based acoustic features in i-vector speaker verification using HT-PLDA

SRE 2011

Abstract: In this study we evaluate the effectiveness of our recently introduced Mean Hilbert Envelope Coefficients (MHEC) in i-vector speaker verification using heavy-tailed probabilistic linear discriminant analysis (HT-PLDA) as the compensation/backend framework. The i-vectors are estimated for MHECs, and also the conventional and widely used MFCCs for comparison. The linear discriminant analysis (LDA) is employed for dimensionality reduction, and followed by the within class covariance normalization (WCCN) scheme to reduce the intra-speaker variability. Finally, scoring the i-vectors is accomplished through: 1) the cosine distance (CD) measure, and 2) the HT-PLDA framework. The impact of i-vector dimension on system performance is explored with the simple yet effective CD scoring. We report speaker verification performance on NIST SRE-2010 extended telephone and microphone trials. Experimental results confirm superiority of MHECs to traditional MFCCs in i-vector speaker veri?cation. Finally, HT-PLDA framework provides significant performance improvement by effectively modeling total space of i-vectors.
Keywords: HT-PLDA, heavy-tailed PLDA, language identification

Automatic regularization of cross-entropy cost for speaker recognition fusion

Abstract: In this paper we study automatic regularization techniques for the fusion of automatic speaker recognition systems. Parameter regularization could dramatically reduce the fusion training time. In addition, there will not be any need for splitting the development set into different folds for cross- validation. We utilize majorization-minimization approach to automatic ridge regression learning and design a similar way to learn LASSO regularization parameter automatically. By experiments we show improvement in using automatic regularization.
Keywords: cross-entropy, speaker recognition fusion, speaker verification

I4U submission to NIST SRE 2012: A large-scale collaborative effort for noise-robust speaker verification

Abstract: I4U is a joint entry of nine research Institutes and Universities across 4 continents to NIST SRE 2012. It started with a brief discussion during the Odyssey 2012 workshop in Singapore. An online discussion group was soon set up, providing a discussion platform for different issues surrounding NIST SRE’12. Noisy test segments, uneven multi-session training, variable enrollment duration, and the issue of open-set identification were actively discussed leading to various solutions integrated to the I4U submission. The joint submission and several of its 17 subsystems were among top-performing systems. We summarize the lessons learnt from this large-scale effort.
Keywords: Speaker Verification, NIST SRE 2012, I4U, ivector

Language Identification

John H.L. Hansen, Gang Liu, “Unsupervised accent classification for deep data fusing of acoustic and language information,” Speech Communication, vol. 78, Apr. 2016, pp. 19-33. [pdf] [bib][corpus][Impact Factor: 1.256][Download Data][Link]
Qian Zhang, Gang Liu, and John H. L. Hansen, “Robust Language Recognition Based on Diverse Fusion,” in Proc. Odyssey 2014, The speaker and language recognition workshop, Joensuu, Finland, June 2014. [PDF] [bib]
Gang Liu, Chi Zhang, John H.L. Hansen, "A Linguistic Data Acquisition Front-End for Language Recognition Evaluation", in Proc. Odyssey, Singapore, 25-28 June 2012 [pdf] [bib]
Gang Liu, John H. L. Hansen, "A systematic strategy for robust automatic dialect identification", EUSIPCO2011, Barcelona, Spain, 2011. pp.2138-2141 [pdf] [bib]
Gang Liu, Yun Lei, John H.L. Hansen, "Dialect Identification: Impact of difference between Read versus spontaneous speech", EUSIPCO-2010. Aalborg, Denmark, 2010. pp.2003-2006 [pdf] [bib]

A Linguistic Data Acquisition Front-End for Language Recognition Evaluation

Odyssey 2012

Abstract: One of the major challenges of the language identification (LID)system comes from the sparse training data. Manually collecting the linguistic data through the controlled studio is usually expensive and impractical. But multilingual broadcast programs (Voice of America, for instance) can be collected as a reasonable alternative to the linguistic data acquisition issue. However, unlike studio collected linguistic data, broadcast programs usually contain many contents other than pure linguistic data: musical contents in foreground/background, commercials, noise from practical life. In this study, a systematic processing approach is proposed to extract the linguistic data from the broadcast media. The experimental results obtained on NIST LRE 2009 data show that the proposed method can provide 22.2% relative improvement of segmentation accuracy and 20.5% relative improvement of LID accuracy.
Keywords: Data acquisition, purification, language identification

A systematic strategy for robust automatic dialect identification

EUSIPCO 2011

Abstract: Automatic dialect Classification is very important for speech based human computer interface and customer electronic products. Although many studies have been performed in ideal environment, little work has been done in noisy or small data corpus, both of which are very critical for the survival of a dialect identification system. This paper investigates a series of strategies to address the question of small and noisy dataset dialect classification task. A novel hierarchical universal background model is proposed to address the question of limited training dataset. To address the noisy question, we initiate the use of perceptual minimum variance distortionless response (PMVDR), combining with shifted delta cepstral (SDC) algorithm. Rotation forest is also explored to further improve the system performance. Finally, compared with the baseline system, the proposed best system shows relative gains of 31:8% and 28:7%, in the worse noise and clean condition on a small data set, respectively.
Keywords: Automatic dialect Classification, PMVDR, SDC, systematic strategy

Dialect Identification: Impact of Differences between Read versus Spontaneous Speech

EUSIPCO 2010

Abstract: Automatic Dialect Classification (ADC) has recently gained substantial interest in the field of speech processing. Dialects of a language normally are reflected in terms of their phoneme space, word pronunciation/selection, and prosodic traits. These traits are clearly visible in natural speaker-to-speaker spontaneous conversations. However, dialect cues in prompted/read speech are often neglected by the community. In this study, we consider a systematic assessment of the differences between the acoustic characteristics of spontaneous and read speech and their effects on dialect identification performance. By examining both the model space and phoneme space of read and spontaneous dialect speech, we observe that each spans different dialect spaces and with distinct characteristics that need to be addressed respectively. From this comparison, we find useful clues to design more efficient identification systems. Finally, we also propose a novel feature extraction technique, PMVDR-SDC, and obtain a +26.4% relative improvement in dialect recognition rate.
Keywords: Automatic Dialect Classification, PMVDR, Read, Spontaneous

Emotion Identification/Affective computing

Gang Liu and John H.L. Hansen, "Supra-Segmental Feature Based Speaker Trait Detection," in Proc. Odyssey 2014, The speaker and language recognition workshop, Joensuu, Finland, June 2014. [PDF] [bib]
Tauhidur Rahman, Soroosh Mariooryad, Shalini Keshavamurthy,Gang Liu , John H.L. Hansen, and Carlos Busso, "Detecting sleepiness by fusing classifiers trained with novel acoustic features",in Proc. INTERSPEECH, Florence, Italy, Aug. 2011, pp.3285-3288 [pdf]
Gang Liu, Yun Lei, John H.L. Hansen, "A Novel Feature Extraction Strategy for Multi-stream Robust Emotion Identification", in Proc. INTERSPEECH, Makuhari Messe, Japan, Sep. 2010, pp.482-485 [pdf]

Detecting sleepiness by fusing classifiers trained with novel acoustic features

InterSpeech 2011

Abstract: Automatic sleepiness detection is a challenging task that can lead to advances in various domains including traffic safety, medicine and human-machine interaction. This paper analyzes the discriminative power of different acoustic features to detect sleepiness. The study uses the sleepy language corpus (SLC). Along with standard acoustic features, novel features are proposed including functionals across voiced segment statistics in the F0 contour, likelihoods of reference models used to contrast non-neutral speech, and a set of robust to noise spectral features. These feature sets, which have performed well in other paralinguistic tasks such as emotion recognition, are used to train classifiers that are combined at the feature and decision levels. The best unweighted accuracy (UA) is obtained by combining the classifiers at the decision level under a maximum likelihood framework (UA = 70.97%). This performance is higher than the best results reported in the corpus.
Keywords: Speaker State Recognition, Paralinguistics, Affective Computing, Sleepiness

A Novel Feature Extraction Strategy for Multi-stream Robust Emotion Identification

InterSpeech 2010

Abstract: We investigate an effective feature extraction front-end for speech emotion recognition, which performs well in clean and noisy conditions. First, we explore the use of perceptual minimum variance distortionless response (PMVDR). These features, originally proposed for accent/dialect and language identification (LID), can better approximate the perceptual scales and are less sensitive to noise and speaker variation. Also developed for LID, shifted delta cepstral (SDC) approach can be used to incorporate additional temporal information. It is known that supra-segmental speech characteristics, such as pitch and intensity, provide better discriminative information for emotion recognition by fusing with other emotion dependent features. Combined PMVDR and SDC together, the system outperforms the baseline system (MFCC based) by 10.3% (absolute). Furthermore, we find both PMVDR and SDC offer much better robustness in noisy condition, which is critical for real applications. All the evaluation the proposed features using the Berlin database of emotion speech.
Keywords: PMVDR, shifted delta cepstral, emotion identification, robustness

Mobile computing:

Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner, "Crowd++: Unsupervised Speaker Count with Smartphones," The 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (ACM UbiComp), Zurich, Switzerland, September 9-12, 2013. Acceptance Rate: 18% (71 out of 395). [PDF][bib]

Crowd++: Unsupervised Speaker Count with Smartphones

UbiComp 2013

Abstract: Smartphones are excellent mobile sensing platforms, with the microphone in particular being exercised in several audio inference applications. We take smartphone audio inference a step further and demonstrate for the first time that it’s possible to accurately estimate the number of people talking in a certain place – with an average error distance of 1.5 speakers – through unsupervised machine learning analysis on audio segments captured by the smartphones. Inference occurs transparently to the user and no human intervention is needed to derive the classi?cation model. Our results are based on the design, implementation, and evaluation of a system called Crowd++, involving 120 participants in 6 very different environments. We show that no dedicated external hardware or cumbersome supervised learning approaches are needed but only off-the-shelf smartphones used in a transparent manner. We believe our findings have profound implications in many research fields, including social sensing and personal wellbeing assessment.
Keywords: Audio Sensing, Smartphone Sensing, Speaker Counting

Automatic Speech Recognition:

Gang Liu, Dimitrios Dimitriadis and Enrico Bocchieri, "Robust speech enhancement techniques for ASR in non-stationary noise and dynamic environments", in Proc. INTERSPEECH2013, Lyon, France, 25-29 Aug.,2013. [pdf] [bib] [Demo]

Robust speech enhancement techniques for ASR in non-stationary noise and dynamic environments

InterSpeech 2013

Abstract: In the current ASR systems the presence of competing speakers greatly degrades the recognition performance. This phenomenon is getting even more prominent in the case of hands-free, far-field ASR systems like the “Smart-TV” systems, where reverberation and non-stationary noise pose additional challenges. Furthermore, speakers are, most often, not standing still while speaking. To address these issues, we propose a cascaded system that includes Time Differences of Arrival estimation, multi-channel Wiener Filtering, non-negative matrix factorization (NMF), multicondition training, and robust feature extraction, whereas each of them additively improves the overall performance. The final cascaded system presents an average of 50% and 45% relative improvement in ASR word accuracy for the CHiME 2011(non-stationary noise) and CHiME 2012 (nonstationary noise plus speaker head movement) tasks, respectively.
Keywords: array signal processing, ASR, robustness, acoustic noise, non-negative matrix factorization