Tools for Vietnamese NLP
ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization
Abstract: ViSoLex is an open-source system designed to address the unique challenges of lexical normalization for Vietnamese social media text. The platform provides two core services: Non-Standard Word (NSW) Lookup and Lexical Normalization, enabling users to retrieve standard forms of informal language and standardize text containing NSWs. ViSoLex’s architecture integrates pre-trained language models and weakly supervised learning techniques to ensure accurate and efficient normalization, overcoming the scarcity of labeled data in Vietnamese. This paper details the system’s design, functionality, and its applications for researchers and non-technical users. Additionally, ViSoLex offers a flexible, customizable framework that can be adapted to various datasets and research requirements. By publishing the source code, ViSoLex aims to contribute to the development of more robust Vietnamese natural language processing tools and encourage further research in lexical normalization. Future directions include expanding the system’s capabilities for additional languages and improving the handling of more complex non-standard linguistic patterns.
Video: Link.
Github: Link.
Paper: Anh Thi-Hoang Nguyen, Dung Ha Nguyen, and Kiet Van Nguyen. 2025. ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization. In Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations, pages 183–188, Abu Dhabi, UAE. Association for Computational Linguistics.
ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing
Abstract: In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes.
Video: Link.
Huggingface: Link.
Paper: Nam Nguyen, Thang Phan, Duc-Vu Nguyen, and Kiet Nguyen. 2023. ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5191–5207, Singapore. Association for Computational Linguistics.
CafeBERT: A Pre-Trained Language Model for Vietnamese
Abstract: CafeBERT, a new state-of-the-art pre-trained model that achieves superior results across all tasks in the VLUE benchmark. Our model combines the proficiency of a multilingual pre-trained model with Vietnamese linguistic knowledge. CafeBERT is developed based on the XLM-RoBERTa model, with an additional pretraining step utilizing a significant amount of Vietnamese textual data to enhance its adaptation to the Vietnamese language. For the purpose of future research, CafeBERT is made publicly available for research purposes.
Video: Link.
Huggingface: Link.
Paper: Phong Nguyen-Thuan Do, Son Quoc Tran, Phu Gia Hoang, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2024. VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 211–222, Mexico City, Mexico. Association for Computational Linguistics.