SMILE Lab at Hopkins - Publications

SMILE Lab at Hopkins

Publications

For the full list, please check this website.

Ulgen, I. R., Du, Z., Lu, J., Koehn, P., Sisman, B (2025) Objective Evaluation of Speech Synthesis through Conditional Prediction of Discrete Tokens. IEEE Open Journal of Signal Processing. (accepted, Arxiv)
Mahapatra, A., Ulgen, I. R., Sisman, B. (2025) HuLA: Prosody-Aware Anti-Spoofing with Multi-task Learning for Expressive and Emotional Synthetic Speech. IEEE Transactions on Affective Computing. (submitted, Arxiv)
Ulgen, I. R., Chandra, S. S., Sisman, B (2025) Text-to-Speech for Unseen Speakers via Low-Complexity Discrete UnitBased Frame Selection. IEEE Open Journal of Signal Processing. (submitted, Arxiv)
Zhenqi, J., Rui, L., Sisman, B., Li, H. (2025) Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis. EMNLP 2025.
Lam, P., Zhang, H., Chen, N. F., Sisman, B., & Herremans, D. (2025). PRESENT: Zero-Shot Text-to-Prosody Control. IEEE Signal Processing Letters.
Mahapatra, A., Ulgen, I. R., Naini, A. R., Busso, C., & Sisman, B. (2025). Can Emotion Fool Anti-spoofing?. Interspeech 2025.
Chandra, S. S., Goncalves, L., Lu, J., Busso, C., & Sisman, B. (2025). EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast. Interspeech 2025.
Liu, R., Gao, P., Xi, J., Sisman, B., Busso, C., & Li, H. (2025). Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset. Interspeech 2025.
Naini, A. R., Goncalves, L., Salman, A. N., Mote, P., Ulgen, I. R., Thebaud, T., ... & Busso, C. (2025). The interspeech 2025 challenge on speech emotion recognition in naturalistic conditions. Interspeech 2025.
Rosero, K., Salman, A. N., Chandra, S., Sisman, B., Vant Slot, C., Kane, A. A., ... Busso, C. Advancing Pediatric ASR: The Role of Voice Generation in Disordered Speech. Interspeech 2025.
Liu, R., Sisman, B., Gao, G., & Li, H. (2024). Controllable accented text-to-speech synthesis with fine and coarse-grained intensity rendering. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Goncalves, L., Leem, S. G., Lin, W. C., Sisman, B., & Busso, C. (2024). Versatile audio-visual learning for emotion recognition. IEEE Transactions on Affective Computing.
Rajapakshe, T., Rana, R., Khalifa, S., Sisman, B., Schuller, B. W., & Busso, C. (2024). emodarts: Joint optimization of cnn & sequential neural network architectures for superior speech emotion recognition. IEEE Access.
K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Emotion Intensity and its Control for Emotional Voice Conversion,” IEEE Transactions on Affective Computing, 1-18, 2022.
R. Liu, B. Sisman, G. Gao and H. Li, “Decoding Knowledge Transfer for Neural Text-to-Speech Training,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022.
B. Sisman, J. Yamagishi, S. King and H. Li, “An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132-157, 2021, doi: 10.1109/TASLP.2020.3038524.
K. Zhou, B. Sisman, R. Liu, H. Li, “Emotional Voice Conversion: Theory, Databases and ESD” Speech Communication, 2021.
R. Liu, B. Sisman, F. Bao, J. Yang, G. Gao and H. Li, “Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 274-285, 2021, doi: 10.1109/TASLP.2020.3040523.
R. Liu, B. Sisman, Y. Lin, H. Li ‘FastTalker: A Neural Text-to-Speech Architecture with Shallow and Group Autoregression’ Neural Networks, 141, 306-314, 2021.
R. Liu, B. Sisman, G. Gao and H. Li, “Expressive TTS Training With Frame and Style Reconstruction Loss,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1806-1818, 2021, doi: 10.1109/TASLP.2021.3076369
R. Liu, B. Sisman, F. Bao, G. Gao and H. Li, “Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS,” in IEEE Signal Processing Letters, vol. 27, pp. 1470-1474, 2020, doi: 10.1109/LSP.2020.3016564.
M. Zhang, B. Sisman, H.Li, ‘DBLSTM-based Voice Conversion with WaveNet Vocoder for Limited Parallel Data’ Speech Communication 122 (2020): 31-43.
B. Sisman, M. Zhang and H. Li, “Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, pp. 1085-1097, June 2019, doi: 10.1109/TASLP.2019.2910637.
Liu, R., Sisman, B., Schuller, B., Gao, G., Li, H. Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning. Proc. Interspeech 2022, 5493-5497.
Du, Z., Sisman, B., Zhou, K., Li, H. Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion. Proc. Interspeech 2022, 2603-2607.
Lam, P., Zhang, H., Chen, N., Sisman, B. EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models. Proc. Interspeech 2022, 823-827.
Lu J, Sisman B, Liu R, et al. VisualTTS: TTS with accurate lip-speech synchronization for automatic voice over[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 8032-8036.
Z. Du, B. Sisman, K. Zhou, H. Li ‘Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer’ ASRU 2021 – IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
S. Nikonorov, B. Sisman M. Zhang, H. Li ‘DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding’ ASRU 2021 – IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021.
K. Zhou, B. Sisman, R. Liu, H. Li ‘Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training’ INTERSPEECH 2021.
R. Liu, B. Sisman, H. Li ‘Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability’ INTERSPEECH 2021.
K. Zhou, B. Sisman, R. Liu and H. Li, “Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset,” ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 920-924, doi: 10.1109/ICASSP39728.2021.9413391.
R. Liu, B. Sisman and H. Li, “Graphspeech: Syntax-Aware Graph Attention Network for Neural Speech Synthesis,” ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6059-6063, doi: 10.1109/ICASSP39728.2021.9413513.
K. Zhou, B. Sisman and H. Li, “Vaw-Gan For Disentanglement And Recomposition Of Emotional Elements In Speech,” 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 415-422, doi: 10.1109/SLT48900.2021.9383526.
B. Sisman, J. Li, F. Bao, G. Gao and H. Li, “Teacher-Student Training For Robust Tacotron-Based TTS,” ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6274-6278, doi: 10.1109/ICASSP40776.2020.9054681.
Du Z, Zhou K, Sisman B, Li H. Spectrum and prosody conversion for cross-lingual voice conversion with cyclegan. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2020 Dec 7 (pp. 507-513). IEEE.
Liu R, Sisman B, Bao F, et al. WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss. Odyssey 2020 The Speaker and Language Recognition Workshop. 2020: 245-251.
Zhang M, Sisman B, Zhao L, et al. Deepconversion: Voice conversion with limited parallel training data[J]. Speech Communication, 2020, 122: 31-43.
Sisman B, Li H. Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data. In Odyssey 2020 (pp. 238-244).
K. Zhou, B. Sisman, M. Zhang, H. Li, “Converting Anyone’s Emotion: Towards Speaker-Independent Emotional Voice Conversion”, INTERSPEECH 2020.
K. Zhou, B. Sisman, H. Li, “Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-parallel Data”, Speaker Odyssey 2020.
B. Sisman, M. Zhang, M. Dong and H. Li, “On the Study of Generative Adversarial Networks for Cross-Lingual Voice Conversion,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 144-151, doi: 10.1109/ASRU46091.2019.9003939.
B. Sisman, K. Vijayan, M. Dong and H. Li, “SINGAN: Singing Voice Conversion with Generative Adversarial Networks,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 112-118, doi: 10.1109/APSIPAASC47483.2019.9023162.
Tjandra, B. Sisman, M. Zhang, S. Sakriani, H. Li, S. Nakamura, ‘VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019’ INTERSPEECH 2019.
B. Sisman, M. Zhang and H. Li, “Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, pp. 1085-1097, June 2019, doi: 10.1109/TASLP.2019.2910637.
Sisman, M. Zhang, S. Sakti, H. Li and S. Nakamura, “Adaptive Wavenet Vocoder for Residual Compensation in GAN-Based Voice Conversion,” 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 282-289, doi: 10.1109/SLT.2018.8639507.
Sisman, H. Li, ‘Limited Data Voice Conversion from Sparse Representation to GANs and WaveNet’, APSIPA ASC 2018 PhD Forum, Hawaii, Honolulu, United States [Best Presentation Award]
Sisman, M. Zhang, H. Li, ‘A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder’, INTERSPEECH 2018, India
B. Sisman, H. Li, ‘Wavelet Analysis of Speaker Dependent and Independent Prosody for Voice Conversion’, INTERSPEECH 2018
B. Sisman, G. Lee, H. Li, ‘Phonetically Aware Exemplar-Based Prosody Transformation’, Speaker Odyssey, France, 2018
M. Zhang, B. Sisman, S. S. Rallabandi, H. Li, L. Zhao, ‘Error Reduction Network for DBLSTM-based Voice Conversion’, APSIPA ASC 2018, Hawaii, Honolulu, United States
J Xiao, S Yang, M. Zhang, B. Sisman, D. Huang, L Xie, M Dong, H Li, ‘The I2R-NWPU-NUS Text-to-Speech System for Blizzard Challenge 2018’, INTERSPEECH Blizzard Challenge 2018 Workshop
Sisman, B. and Li, H., 2018. Wavelet Analysis of Speaker Dependent and Independent Prosody for Voice Conversion. Proc. Interspeech 2018, pp.52-56.
B. Sisman, H. Li, K. C. Tan, “Transformation of prosody in voice conversion”, APSIPA ASC 2017, Kuala Lumpur, Malaysia
B. Sisman, H. Li, K. C. Tan, “Sparse Representation of Phonetic Features for Voice Conversion with and without Parallel Data”, IEEE ASRU 2017, Okinawa, Japan

Page updated

Google Sites

Report abuse