Vietnamese Benchmark Datasets For Research and Education

Keywords: Vietnamese datasets, Vietnamese corpora, Vietnamese corpus, Vietnamese resources. Introduction Slides.

UIT-ViQuAD - A Vietnamese Dataset for Evaluating Machine Reading Comprehension. Bộ Dữ liệu Đọc hiểu Tự động cho Tiếng Việt.

Abstract: Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a new process of dataset creation for Vietnamese MRC. Our in-depth analyses illustrate that our dataset requires abilities beyond simple reasoning like word matching and demands single-sentence and multiple-sentence inferences. Besides, we conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD. We also estimate human performance on the dataset and compare it to the experimental results of powerful machine learning models. As a result, the substantial differences between human performance and the best model performance on the dataset indicate that improvements can be made on UIT-ViQuAD in future research. Our dataset is freely available on our website to encourage the research community to overcome challenges in Vietnamese MRC.

Other Machine Reading Comprehension datasets: SQuAD (for English), UIT-ViQuAD (for Vietnamese), KorQuAD (for Korean), FQuAD (for French), and SberQuAD (for Russian).

Paper: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. A Vietnamese Dataset for Evaluating Machine Reading Comprehension. COLING 2020. Link.

To access this dataset, please complete and sign the dataset user agreement and then send it via email (kietnv@uit.edu.vn) to receive the dataset.

UIT-ViWikiQA: A Sentence Extraction-Based Machine Reading Comprehension Dataset for Vietnamese

Abstract: The development of Vietnamese language processing in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine-reading comprehension tasks in Vietnamese with large sizes, such as UIT-ViQuAD and UIT-ViNewsQA. However, the datasets are not diverse in answer to serve the research. In this paper, we introduce the UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in the Vietnamese language. The UIT-ViWikiQA dataset is converted from the UIT-ViQuAD dataset, consisting of comprises 23.074 question-answers based on 5.109 passages of 174 Vietnamese articles from Wikipedia. We propose a conversion algorithm to create the dataset for sentence extraction-based machine reading comprehension and three types of approaches on the sentence extraction-based machine reading comprehension for Vietnamese. Our experiments show that the best machine model is XLM-R$_Large, which achieves an exact match (EM) score of 85.97% and an F1-score of 88.77% on our dataset. Besides, we analyze experimental results in terms of the question type in Vietnamese and the effect of context on the performance of the MRC models, thereby showing the challenges from the UIT-ViWikiQA dataset that we propose to the natural language processing community.

Paper: Do, P.N.T., Nguyen, N.D., Van Huynh, T., Van Nguyen, K., Nguyen, A.G.T. and Nguyen, N.L.T., 2021. Sentence Extraction-Based Machine Reading Comprehension for Vietnamese. arXiv preprint arXiv:2105.09043.

To access this dataset, please complete and sign the dataset user agreement and then send it via email (kietnv@uit.edu.vn) to receive the dataset.

UIT-ViNewsQA: New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

Large-scale and high-quality corpora are really necessary for evaluating machine reading comprehension models on the low-resource language like Vietnamese. In addition, machine reading comprehension for the health domain offers great potential for practical applications; however, there is still very little machine reading comprehension research in this domain. In this study, we present UIT-ViNewsQA as a new corpus for the Vietnamese language to evaluate models of healthcare reading comprehension. The corpus comprises 22,077 human-generated question--answer pairs. Crowd-workers create the questions and their answers based on a set of over 4,419 online Vietnamese healthcare news articles, where the answers comprised spans extracted from the corresponding articles. In particular, we develop a process of creating a corpus for the Vietnamese machine reading comprehension. Comprehensive evaluations demonstrated that our corpus requires abilities beyond simple reasoning such as word matching, as well as demanding difficult reasoning similar to inferences based on single-or-multiple-sentence information. We conduct experiments using state-of-the-art methods for machine reading comprehension to obtain the first baseline performance measures, which will be compared with further models' performances. We measure human performance based on the corpus and compared it with several strong neural network-based models. Our experiments showed that the best model was BERT, which achieved an exact match score of 57.57% and F1-score of 76.90% on our corpus. The significant difference between humans and the best model (F1-score of 15.93%) on the test set of our corpus indicates that improvements in UIT-ViNewsQA can be explored in future research. Our corpus is freely available on our website in order to encourage the research community to make these improvements.

Paper: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles. Link.

To access this dataset, please complete and sign the dataset user agreement and then send it via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to receive the dataset.

ViMMRC (version 1.0) - Vietnamese Multiple-choice Machine Reading Comprehension Corpus

Abstract: Machine Reading Comprehension (MRC) is the task of natural language processing that studies the ability to read and understand unstructured texts and then find the correct answers for questions. Until now, we have not yet had any MRC dataset for such a low-resource language as Vietnamese. In this paper, we introduce ViMMRC, a challenging machine comprehension corpus with multiple-choice questions, intended for research on the machine comprehension of Vietnamese text. This corpus includes 2,783 multiple-choice questions and answers based on a set of 417 Vietnamese texts used for teaching reading comprehension for 1st to 5th graders. Answers may be extracted from the contents of single or multiple sentences in the corresponding reading text. A thorough analysis of the corpus and experimental results in this paper illustrate that our corpus ViMMRC demands reasoning abilities beyond simple word matching. We proposed the method of Boosted Sliding Window (BSW) that improves 5.51% in accuracy over the best baseline method. We also measured human performance on the corpus and compared it to our MRC models. The performance gap between humans and our best experimental model indicates that significant progress can be made on Vietnamese machine reading comprehension in further research. The corpus is freely available at our website for research purposes.

Paper: Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen, Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice reading comprehension. Link.

Please download this dataset/corpus here .

UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection dataset)

Abstract: The rise of social media has led to the increasing of comments on online forums. However, there still exists some invalid comments which were not informative for users. Moreover, those comments are also quite toxic and harmful to people. In this paper, we create a dataset for classifying constructive and toxic speech detection, named UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection dataset) with 10,000 human-annotated comments. For these tasks, we proposed a system for constructive and toxic speech detection with the state-of-the-art transfer learning model in Vietnamese NLP as PhoBERT. With this system, we achieved 78.59% and 59.40% F1-score for identifying constructive and toxic comments separately. Besides, to have an objective assessment for the dataset, we implement a variety of baseline models as traditional Machine Learning and Deep Neural Network-Based models. With the results, we can solve some problems on the online discussions and develop the framework for identifying constructiveness and toxicity Vietnamese social media comments automatically.

Paper: Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese. The 34th International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA/AIE 2021). Link.

Please contact us via email: 17520721@gm.uit.edu.vn (Mr. Luan Nguyen) to sign the corpus user agreement and then receive the corpus.

UIT-VSFC (version 1.0) - Vietnamese Students’ Feedback Corpus

Abstract: Students’ feedback is a vital resource for the interdisciplinary research involving the combining of two different research fields between sentiment analysis and education. Vietnamese Students’ Feedback Corpus (UIT-VSFC) is the resource consists of over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications. To assess the quality of our corpus, we measure the annotator agreements and classification evaluation on the UIT-VSFC corpus. As a result, we obtained the inter-annotator agreement of sentiments and topics with more than over 91% and 71% respectively. In addition, we built the baseline model with the Maximum Entropy classifier and achived approximately 88% of the sentiment F1-score and over 84% of the topic F1-score.

Paper: Kiet Van Nguyen, Vu Duc Nguyen, Phu Xuan-Vinh Nguyen, Tham Thi-Hong Truong, Ngan Luu-Thuy Nguyen, UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis, 2018 10th International Conference on Knowledge and Systems Engineering (KSE 2018), November 1-3, 2018, Ho Chi Minh City, Vietnam. Link.

Please download this dataset/corpus here .

UIT-VSMEC (version 1.0) - Vietnamese Social Media Emotion Corpus

Emotion recognition is a higher approach or special case of sentiment analysis. In this task, the result is not produced in terms of either polarity: positive or negative or in the form of rating (from 1 to 5) but of a more detailed level of sentiment analysis in which the result are depicted in more expressions like sadness, enjoyment, anger, disgust, fear and surprise. Emotion recognition plays a critical role in measuring brand value of a product by recognizing specific emotions of customers’ comments. In this study, we have achieved two targets. First and foremost, we built a standard Vietnamese Social Media Emotion Corpus (UIT-VSMEC) with about 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese which is a low-resource language in Natural Language Processing (NLP). Secondly, we assessed and measured machine learning and deep neural network models on our UIT-VSMEC. As a result, Convolutional Neural Network (CNN) model achieved the highest performance with 57.61% of F1-score.

Paper: Vong Ho, Duong Nguyen, Danh Nguyen, Linh Pham, Kiet Nguyen and Ngan Nguyen, Emotion Recognition for Vietnamese Social Media Text, 2019 16th International Conference of the Pacific Association for Computational Linguistics (PACLING 2019), October 11-13, 2019, Ha Noi, Vietnam. Link.

Please download this dataset/corpus here .

UIT-ViIC (version 1.0) - Vietnamese Image Captioning Dataset

Automatic generation of image captions has attracted attentions from researchers in various fields of computer science such as computer vision, natural language processing and machine learning in recent years. This paper contributes to Image captioning problem in terms of extending Image captioning dataset to different language. In particular, we concentrate on generating Vietnamese captions for images, as there is no dataset in Image captioning for Vietnamese existed. We propose a dataset called UIT-ViIC which was annotated manually in Vietnamese with the images from MS - COCO dataset. In addition, we built a web-based annotation tool for improving annotators performances. UIT-ViIC in this scope consists of 19,250 captions for 3,850 images on sport-ball. UIT-ViIC is then experimented and evaluated on existing Image captioning deep neural network models. Our dataset in this scope will be published this on our lab website for researching purpose.

Paper: Quan Hoang Lam, Quang Duy Le, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning. Link.

Please download this dataset/corpus here.

UIT-ViNames (version 1.0) - Vietnamese Name Dataset

Abstract—As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposal for English and Chinese languages are tremendous; still, there has been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotated with genders. This dataset is available on our website for research purposes. In addition, this paper describes six machine learning algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive Bayes, Decision Tree, Random Forrest and Logistic Regression) and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names. We create a dataset and investigate the impact of each name component on detecting gender. As a result, the best F1-score that we have achieved is up to 96% on LSTM model and we generate a web API based on our trained model.

Paper: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, and Anh Gia-Tuan Nguyen. Gender Prediction Based on Vietnamese Names with Machine Learning Techniques. Link.

Trial API of Vietnamese Name Dataset, you can try here: Link .

To access this dataset, please complete and sign the dataset user agreement and then send it via email: huytq@uit.edu.vn (Mr. Huy To) to receive the dataset.

UIT-ViOCD: Vietnamese Open-domain Complaint Detection Dataset

Customer product reviews play a role in improving the quality of products and services for organizations or brands. Complaining is an attitude that expresses dissatisfaction with an event or a product not meeting customer expectations. In this paper, we build a Vietnamese dataset (UIT-ViOCD), including 5,485 human-annotated reviews on four categories about product reviews on e-commerce sites. After the data collection phase, we proceed to the annotation task and achieve Am = 87% by Fleiss' Kappa. Then, we present an extensive methodology for the research purposes and achieve 92.16% by F1-score for identifying complaints. With the results, in the future, we want to build a system for open-domain complaint detection on E-commerce websites.

Paper: Nhung Thi-Hong Nguyen, Phuong Ha-Dieu Phan, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. Vietnamese Complaint Detection on E-Commerce Websites. Link.

Please contact us via email: 18521218@gm.uit.edu.vn (Ms. Nhung) to sign the corpus user agreement and then receive the corpus.

ViHSD – Vietnamese Hate Speech Detection Dataset

Abstract—In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social media, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD – a human-annotated dataset for automatically detecting hate speech on the social network. This dataset contains over 30,000 comments, each comment in the dataset has one of three labels: CLEAN, OFFENSIVE, or HATE. Besides, we introduce the data creation process for annotating and evaluating the quality of the dataset. Finally, we evaluated the dataset by deep learning models and transformer models.

Paper: Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts. Link.

Please contact us via email: sonlt@uit.edu.vn (Mr. Son Luu) to sign the corpus user agreement and then receive the corpus.

UIT-ViCoQA: A Conversational Question Answering Challenge for Healthcare Texts in Vietnamese

Machine reading comprehension (MRC) is a sub-field in natural language processing or computational linguistics. MRC aims to help computers understand unstructured texts and then answer questions related to them. In this paper, we present a new Vietnamese dataset for conversational machine reading comprehension, consisting of 10,000 questions with answers over 2,000 conversations about health news articles. We analyze UIT-ViCoQA in-depth with different linguistic aspects. We evaluate strong dialogue and reading comprehension models on UIT-ViCoQA. In addition, we conduct the first experiments on this dataset and achieve positive performances. The best system obtains an F1 score of 51,28%, which is 24.90 points behind human performance (76,18%), indicating that there is ample room for improvement. The dataset is available at our research website for research purposes.

Paper: Son T. Luu, Mao Nguyen Bui, Loi Duc Nguyen, Khiem Vinh Tran, Kiet Van Nguyen (Corresponding Author), Ngan Luu-Thuy Nguyen. Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts. Link.

To access this dataset, please complete and sign the dataset user agreement and then send it via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to receive the dataset.

UIT-ViSFD: A Vietnamese Smartphone Feedback Dataset for Aspect-Based Sentiment Analysis

In this paper, we present a process of building a social listening system based on aspect-based sentiment analysis in Vietnamese from creating a dataset to building a real application. Firstly, we create UIT-ViSFD, a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-annotated comments for mobile e-commerce, which is freely available for research purposes. We also present a proposed approach based on the Bi-LSTM architecture with the fastText word embeddings for the Vietnamese aspect-based sentiment task. Our experiments show that our approach achieves the best performances with the F1-score of 84.48% for the aspect task and 63.06% for the sentiment task, which performs several conventional machine learning and deep learning systems. Last but not least, we build SA2SL, a social listening system based on the best performance model on our dataset, which will inspire more social listening systems in future.

Paper: Luong Luc Phan, Phuc Huynh Pham, Kim Thi-Thanh Nguyen, Tham Thi Nguyen, Sieu Khai Huynh, Luan Thanh Nguyen, Tin Van Huynh, Kiet Van Nguyen. SA2SL: From Aspect-Based Sentiment Analysis to Social Listening System for Business Intelligence. Link.

Please download this dataset/corpus: here.

UIT-ViSD4SA: Vietnamese Span Detection for Sentiment Analysis

UIT-ViSD4SA is a benchmark Vietnamese smartphone feedback dataset for ABSA and span detection. UIT-ViSD4SA consisting of 35,396 human-annotated spans on 11,122 feedback comments, and each is manually annotated according to its spans towards ten fine-grained aspect categories with sentiment polarities. We split the dataset into a training set (7,784), a development set (1,113) and a test set (2,225) randomly.

Paper: Kim Thi-Thanh Nguyen, Phuc Huynh Pham, Luong Luc Phan, Sieu Khai Huynh, Duc-Vu Nguyen, Kiet Van Nguyen. Span Detection for Aspect-Based Sentiment Analysis in Vietnamese. Link.

Please download this dataset/corpus: here.