This site consists of codes and resources generated from my research that I would love to share with others.
Please cite the corresponding paper when you use any code or resources here. Thank you!
Given recent advancement of Large Language Models (LLMs), the task of translating from natural language prompts to different programming languages (code generation) attracts immense attention for wide application in different domains. Specially code generation for Bash (NL2Bash) is widely used to generate Bash scripts for automating different tasks, such as performance monitoring, compilation, system administration, system diagnostics, etc.
Execution-based Evaluation can validate the predicted code by comparing the execution output of model prediction and expected output in system. However, designing and implementing such an execution-based evaluation system for NL2Bash is not a trivial task. Our paper, Tackling Execution-Based Evaluation for NL2Bash, presents the design and discusses the challenges of our benchmark for NL2Bash.
@misc{vo2024tacklingexecutionbasedevaluationnl2bash,
title={Tackling Execution-Based Evaluation for NL2Bash},
author={Ngoc Phuoc An Vo and Brent Paulovicks and Vadim Sheinin},
year={2024},
eprint={2405.06807},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.06807},
}
Temporal aspect is one of the most challenging areas in Natural Language Interface to Databases (NLIDB). This paper addresses and examines how temporal questions being studied and supported by the research community at both levels: popular annotated dataset (eg Spider) and recent advanced models. We present a new dataset with accompanied databases supporting temporal questions in NLIDB. We experiment with two SOTA models (Picard and ValueNet) to investigate how our new dataset helps these models learn and improve performance in temporal aspect.
@inproceedings{vo2022tackling,
title={Tackling Temporal Questions in Natural Language Interface to Databases},
author={Vo, Ngoc Phuoc An and Popescu, Octavian and Manotas, Irene and Sheinin, Vadim},
booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track},
pages={179--187},
year={2022}
}
Motion recognition is one of the basic cognitive capabilities of many life forms, yet identifying motion of physical entities in natural language have not been explored extensively and empirically. We present the Literal-Motion-inText (LiMiT) dataset, a large human-annotated collection of English text sentences describing physical occurrence of motion, with annotated physical entities in motion. We describe the annotation process for the dataset, analyze its scale and diversity, and report results of several baseline models. We also present future research directions and applications of the LiMiT dataset and share it publicly as a new resource for the research community.
@inproceedings{DBLP:conf/emnlp/ManotasVS20,
author = {Manotas, Irene and Vo, Ngoc Phuoc An and Sheinin, Vadim},
title = {LiMiT: The Literal Motion in Text Dataset},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, {EMNLP} 2020},
year={2020}
}
The annotation for Semantic Relatedness score on the Recognizing Textual Entailment (RTE) datasets.
In this work we present the creation of a corpora annotated with both semantic relatedness (SR) scores and textual entailment (TE) judgments. In building this corpus we aimed at discovering, if any, the relationship between these two tasks for the mutual benefit of resolving one of them by relying on the insights gained from the other. We considered a corpora already annotated with TE judgments and we proceed to the manual annotation with SR scores. The RTE 1-4 corpora used in the PASCAL competition fit our need. The annotators worked independently of one each other and they did not have access to the TE judgment during annotation. The intuition that the two annotations are correlated received major support from this experiment and this finding led to a system that uses this information to revise the initial estimates of SR scores. Download the annotated datasets here.
@inproceedings{vo2016corpora,
title={Corpora for Learning the Mutual Relationship between Semantic Relatedness and Textual Entailment},
author={Vo, Ngoc Phuoc An and Popescu, Octavian},
booktitle={In Proceedings of the 10th Edition of the Language Resources and Evaluation Conference (LREC 2016)},
year={2016}
}
An annotated corpus for Identifying User Issues and Request Types in Forum Question Posts Based on Discourse Analysis
In this work we propose the detection of user issues and request types in technical forum question posts with a twofold purpose: supporting up-to-date knowledge generation in organizations that provide (semi-) automated customer-care services, and enriching forum metadata in order to enhance the effectiveness of search. We present a categorization system for detecting the proposed question post types based on discourse analysis, and show the advantage of using discourse patterns compared to a baseline relying on standard linguistic features. As a publicly available dataset, which would be appropriate for our purposes, does not exist, we developed an annotated corpus for our experiments. Download the annotated corpus with annotation guideline here.
@inproceedings{sandor2016identifying,
title={Identifying User Issues and Request Types in Forum Question Posts Based on Discourse Analysis},
author={Sandor, Agnes and Lagos, Nikolaos and Vo, Ngoc-Phuoc-An and Brun, Caroline},
booktitle={Proceedings of the 25th International Conference Companion on World Wide Web},
pages={685--691},
year={2016}
}