TOOLS - DATASETS - PUBLICATIONS
Color-coded topics: ANNOTATED DATA EVALUATION GENERATION
TOOLS
WebNLG+ official code on Colab: https://github.com/mille-s/WebNLG-2020_Metrics
The official WebNLG+ metrics can now be run on Colab; check the GitHub Readme for more details on the modifications of the original code.
Quantitative evaluation of graph-transduction rules: https://github.com/mille-s/FORGe_count-rules
Code for a quantitative assessment of the rules used in the FORGe generator: rule count by grammar and by module, proportion of language-independent rules.
Sentence similarity assessment: https://github.com/mille-s/Sentence-similarity (see Publications)
For automatic coverage exension of the rule-based generator, or for accuracy evaluation, we need to assess the similarity between an input data point (e.g. a DBpedia triple) and a text that verbalises it. This code is for creating a fine-tuning dataset, fine-tuning Sentence Transformers, and evaluating the produced model(s).
Code for GEM evaluation: sampling, data preparation, table/plot creation: https://github.com/mille-s/GEM24_D2T_StratifiedSampling
This notebook contains the code used for the GEM 2024 shared task metrics and human evaluation.
Test environment for Natural Language Generation (private): https://github.com/mille-s/Did-I-break-FORGe
A series of tests to control that updates in FORGe do not affect other projects and languages. The GitHub repo is private because it directly accesses the unreleased development generation resources.
M-FleNS pipeline for Natural Language Generation: https://github.com/mille-s/M-FleNS_NLG-Pipeline
This is the first release of the M-FleNS pipeline, which allows for calling FORGe module by module and see intermediate representations. It is used to produce the multilayered English, Irish and French datasets (papers to appear) and will eventually allow for replacing some of the RGB modules by NLMB ones.
Irish text generator for WebNLG'23: https://github.com/mille-s/DCU_TCD-FORGe_WebNLG23 (see Publications)
This is a repository that contains the code to produce Irish texts from the input DBpedia triples sets of the WebNLG'23 dataset. The code is based on the M-FleNS code (see above) and integrates the morphology generator of the IrishNLP tools suite.
Wikipedia short text generator: https://github.com/mille-s/WikipediaPage_Generator, Poster
This is a repository that contains the code to produce Irish and English texts from live queries to DBpedia. The code is based on the M-FleNS code (see above) and integrates the morphology generator of the IrishNLP tools suite as well as components that query DBpedia and create input linguistic structures for the FORGe generator.
UD Converter: https://github.com/mille-s/UD_Converter
This converter can be used to create training material (e.g. for natural language generation tools) or to extract semantics-oriented relations from UD trees.
Mod-D2T-code: https://github.com/mille-s/Mod-D2T (see Publications)
This converter is based on the M-FleNS code and allows for producing a multi-layer dataset for Modular Data-to-Text generation. It produces a series of linguistic representations (semantics, syntax, morphology) aligned with structured a data input and an output text in the CoNLL-U format.
DATASETS
Mod-D2T-data-en: https://github.com/mille-s/Mod-D2T/tree/main/conllu-en_INLG23 (see Publications)
This repository contains the first version of the Mod-D2T dataset for English (en), built on top of the WebNLG'23 dataset; it is parallel to the Mod-D2T-data-ga dataset below. The dataset contains over 1.9 million nodes over 10 layers of representation and is available in the CoNLL-U format. Description of layers and tagsets can be found in our INLG'23 paper.
Mod-D2T-data-ga: https://github.com/mille-s/Mod-D2T/tree/main/conllu-ga_Pan-DL23 (see Publications)
This repository contains the first version of the Mod-D2T dataset for Irish (ga), built on top of the WebNLG'23 dataset; it is parallel to the Mod-D2T-data-en dataset above. The dataset contains over 2 million nodes over 10 layers of representation and is available in the CoNLL-U format. Description of layers and tagsets can be found in our INLG'23 and Pan-DL'23 papers.
Mod-D2T-data-fr: https://github.com/mille-s/Mod-D2T/tree/main/conllu-fr_v0.1
This repository contains the first version of the Mod-D2T dataset for French (fr), built on top of the WebNLG'23 dataset; it is parallel to the Mod-D2T-data-en and -ga datasets above. The dataset contains over 2 million nodes over 10 layers of representation and is available in the CoNLL-U format. Description of layers and tagsets can be found in our INLG'23 and Pan-DL'23 papers (see above).
PUBLICATIONS
Anya Belz, João Sedoc, Craig Thomson, Simon Mille and Rudali Huidrom. The INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background, Overall Aims, and Summaries of Taught Units. In Proceedings of the 17th International Natural Language Generation Conference, Tutorials, pages 1–12, Tokyo, Japan. PDF. GitHub.
Simon Mille, Mohammed Sabry and Anya Belz. 2024. DCU-NLG-Small at the GEM'24 Data-to-Text Task: Rule-based generation and post-processing with T5-Base. In Proceedings of the 17th International Natural Language Generation Conference, Generation Challenges, pages 84–91, Tokyo, Japan. PDF.
Simon Mille, João Sedoc, Yixin Liu, Elizabeth Clark, Agnes Axelsson, Miruna Clinciu, Yufang Hou, Saad Mahamood, Ishmael Obonyo and Lining Zhang. 2024. The 2024 GEM Shared Task on Multilingual Data-to-Text Generation and Summarization: Overview and Preliminary Results. In Proceedings of the 17th International Natural Language Generation Conference, Generation Challenges, pages 17–38, Tokyo, Japan. PDF.
Anya Belz, Simon Mille, Craig Thomson and Rudali Huidrom. 2024. QCET: An Interactive Taxonomy of Quality Criteria for Comparable and Repeatable Evaluation of NLP Systems. In Proceedings of the 17th International Natural Language Generation Conference, System Demonstrations (INLG), pages 9–12, Tokyo, Japan. PDF.
Simon Mille, Massimiliano Pronesti, Craig Thomson, Michela Lorandi, Sophie Fitzpatrick, Rudali Huidrom, Mohammed Sabry, Amy O'Riordan and Anya Belz. 2024. Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups. In Proceedings of the 17th International Natural Language Generation Conference, System Demonstrations (INLG), pages 16–19, Tokyo, Japan. PDF. Best Demo Paper Award at the 2024 INLG conference!
Marcel Nawrath, Agnieszka Wiktoria Nowak, Tristan Ratz, Danilo Constantin Walenta, Juri Opitz, Leonardo F. R. Ribeiro, João Sedoc, Daniel Deutsch, Simon Mille, Yixin Liu, Sebastian Gehrmann, Lining Zhang, Saad Mahamood, Miruna Clinciu, Khyathi Chandu, Yufang Hou. 2024. On the Role of Summary Content Units in Text Summarization Evaluation. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 272--281, Mexico City, Mexico. PDF.
Simon Mille, Elaine Uí Dhonnchadha, Lauren Cassidy, Brian Davis, Stamatia Dasiopoulou, Anya Belz. 2023. Generating Irish Text with a Flexible Plug-and-Play Architecture. In Proceedings of the Second Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning (Pan-DL), pages 25–42, Singapore. PDF, Code, Data, Slides. Best Overall Contribution Award at the 2024 ADAPT conference!
Simon Mille, Elaine Uí Dhonnchadha, Stamatia Dasiopoulou, Lauren Cassidy, Brian Davis, Anya Belz. 2023. DCU/TCD-FORGe at WebNLG’23: Irish rules!. In Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge, pages 87-92, Prague, Czech Republic. PDF, Code, Slides
Simon Mille, François Lareau, Stamatia Dasiopoulou, Anya Belz. 2023. Mod-D2T: A Multi-layer Dataset for Modular Data-to-Text Generation. In Proceedings of the 16th International Natural Language Generation Conference (INLG), pages 455–466, Prague, Czech Republic. PDF, Code, Data-en, Data-ga, Poster
Anya Belz, Craig Thomson, Ehud Reiter, Simon Mille. 2023. Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3676-3687, Toronto, Canada. PDF.
Lining Zhang, Simon Mille, Yufang Hou, Daniel Deutsch, Elizabeth Clark, Yixin Liu, Saad Mahamood, Sebastian Gehrmann, Miruna Clinciu, Khyathi Chandu, João Sedoc. 2023. A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 14944-14982, Toronto, Canada. PDF.
Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshuang Wu et al. 2023. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation. In The Northern European Journal of Language Technology (NEJLT), Vol. 9 No. 1. PDF, Code
Simon Mille, Josep Ricci, Alexander Shvets, and Anya Belz. 2023. A Pipeline for Extracting Abstract Dependency Templates for Data-to-Text Natural Language Generation. In Proceedings of the Seventh International Conference on Dependency Linguistics (Depling, GURT/SyntaxFest 2023), pages 91–101, Washington, D.C.. Association for Computational Linguistics. PDF, Code, Poster.