Research Resources

This site consists of codes and resources generated from my research that I would love to share with others.

Please cite the corresponding paper when you use any code or resources here. Thank you!


  • The Literal Motion in Text Dataset (LiMiT)

Motion recognition is one of the basic cognitive capabilities of many life forms, yet identifying motion of physical entities in natural language have not been explored extensively and empirically. We present the Literal-Motion-inText (LiMiT) dataset, a large human-annotated collection of English text sentences describing physical occurrence of motion, with annotated physical entities in motion. We describe the annotation process for the dataset, analyze its scale and diversity, and report results of several baseline models. We also present future research directions and applications of the LiMiT dataset and share it publicly as a new resource for the research community.

@inproceedings{DBLP:conf/emnlp/ManotasVS20,

author = {Manotas, Irene and Vo, Ngoc Phuoc An and Sheinin, Vadim},

title = {LiMiT: The Literal Motion in Text Dataset},

booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, {EMNLP} 2020},

year={2020}

}


  • The annotation for Semantic Relatedness score on the Recognizing Textual Entailment (RTE) datasets.

In this work we present the creation of a corpora annotated with both semantic relatedness (SR) scores and textual entailment (TE) judgments. In building this corpus we aimed at discovering, if any, the relationship between these two tasks for the mutual benefit of resolving one of them by relying on the insights gained from the other. We considered a corpora already annotated with TE judgments and we proceed to the manual annotation with SR scores. The RTE 1-4 corpora used in the PASCAL competition fit our need. The annotators worked independently of one each other and they did not have access to the TE judgment during annotation. The intuition that the two annotations are correlated received major support from this experiment and this finding led to a system that uses this information to revise the initial estimates of SR scores. Download the annotated datasets here.

@inproceedings{vo2016corpora,

title={Corpora for Learning the Mutual Relationship between Semantic Relatedness and Textual Entailment},

author={Vo, Ngoc Phuoc An and Popescu, Octavian},

booktitle={In Proceedings of the 10th Edition of the Language Resources and Evaluation Conference (LREC 2016)},

year={2016}

}


  • An annotated corpus for Identifying User Issues and Request Types in Forum Question Posts Based on Discourse Analysis

In this work we propose the detection of user issues and request types in technical forum question posts with a twofold purpose: supporting up-to-date knowledge generation in organizations that provide (semi-) automated customer-care services, and enriching forum metadata in order to enhance the effectiveness of search. We present a categorization system for detecting the proposed question post types based on discourse analysis, and show the advantage of using discourse patterns compared to a baseline relying on standard linguistic features. As a publicly available dataset, which would be appropriate for our purposes, does not exist, we developed an annotated corpus for our experiments. Download the annotated corpus with annotation guideline here.

@inproceedings{sandor2016identifying,

title={Identifying User Issues and Request Types in Forum Question Posts Based on Discourse Analysis},

author={Sandor, Agnes and Lagos, Nikolaos and Vo, Ngoc-Phuoc-An and Brun, Caroline},

booktitle={Proceedings of the 25th International Conference Companion on World Wide Web},

pages={685--691},

year={2016}

}