Universal Discourse

Universal Discourse (UDisc) aims to provide a unified description of discourse relations (in terms of discourse markers, discourse units, and relation types) within a multilingual setting, by harmonising existing discourse corpora (annotated according to various formalisms, including PDTB, RST and SDRT) within a common representation model. This process is based on the bottom-up approach of Universal Dependencies and the syntax-aware definitions of elementary discourse units. It uses a repertoire of discourse relations based on the ISO 24617-8 standard. The computational properties of the newly created description will be verified by creating prototypes of a multilingual discourse parser fine-tuned using existing large language models.

Project Tasks

Task 1. Development of a multilingual ontology of discourse relations

The ontology will be modelled on the ISO 24617-8 standard and will include usage examples extracted from existing discourse corpora and literature. Speakers of at least 10 European languages with linguistic backgrounds will be consulted to ensure sufficient coverage of constructs.

Subtask 1.1. Discourse ontology

The ontology will be modelled after the ISO 24617-8 standard which provides an open, extensible set of core relations and an outline of their use in discourse modelling. A separate part of the ontology will be a multilingual discourse marker inventory.

Subtask 1.2. Ontology viewer

To make the ontology available to non-experts in Linguistic Linked Open Data or Semantic Web, the subtask will propose a serialization method for discourse relations, taking into account the issue of marking relation triggers, one- vs. multi-layer representation of various discourse-related properties etc. The method will take the form of a browser-based viewer presenting the ontology in a user-friendly way.

Task 2. Harmonization of multilingual discourse corpora

The newly proposed discourse model will be used to re-annotate several existing discourse corpora, previously annotated with various discourse representation formalisms.

Subtask 2.1. Annotation guidelines

The guidelines will contain detailed instructions on the procedure of annotation of discourse relations in the context of specific markers, methods of consistency checks and examples of use.

Subtask 2.2. Pre-annotation

To facilitate annotation, a set of converters from existing formats will be implemented and used in the process of automated pre-annotation of the corpora. Additionally, the conversion step will also harmonize the metadata of all pre-selected corpora.

Subtask 2.3. Manual annotation

The textual version of the corpora will be manually annotated in INCEpTION, an open-source corpus annotation tool. Then, in a subsequent step, the process of a second review of both stages of the annotation will be performed. The annotated corpora will be made available in a public repository in at least two formats (UIMA CAS XMI and CoNLL-U) to facilitate their use.

Task 3. Implementation and evaluation of discourse parser prototypes

The goal of this task is the implementation and evaluation of a series of discourse parser prototypes (using various language processing techniques).

Subtask 3.1. Prototype implementation

Several discourse parser prototypes will be implemented following the most current trends in natural language processing (recently: fine-tuning of pre-trained large language models). This step intends to confirm the computational validation of the proposed model in the multilingual context and using real-life data.

Subtask 3.2. Prototype evaluation

The prototypes will be quantitatively evaluated to select the best performing variant. The evaluation will use precision, recall and F1 measures in 3 scoring schemes: exact, partial and overlap-based.

Task 4. Analysis and elaboration of the results

All results of the project (the discourse model, harmonized discourse corpora and discourse parser prototypes) will be used in a comparative study and described in a monograph. They will also be used in the evaluation campaign, most likely in the form of a shared task planned to be co-located with one of the discourse or anaphora-related workshops such as CODI.

Subtask 4.1. Comparative study

This subtask aims to demonstrate the value of the universal model developed by the project to the linguistic community. It will be achieved by carrying out translation analyses by compiling the most common equivalents of particular operators in languages and checking the entanglement of discourse relations across languages. The frequency of occurrence of particular relations in texts in languages will also be investigated to track correlations between language family and relation/argument frequency.

Subtask 4.2. The evaluation campaign

The objective of this subtask is to organise an evaluation campaign built upon the project’s achievements, which will encompass the created multilingual dataset, the established evaluation methodology, and the deployed prototype parsers that will serve as baseline systems. The evaluation campaign is likely to embrace the structure of a shared task, an evaluation framework widely recognised as the most prevalent and favoured in NLP. Alternatively, we may opt for a benchmarking system that includes a leaderboard and a preferable human-in-the-loop mode, or consider another contemporary framework for evaluation assessment. The campaign will function as a comprehensive assessment of discourse parsers created by interested parties. These parsers will be rigorously evaluated concerning their effectiveness and accuracy, their practical usability in real-world scenarios, and their capacity for generalisation across diverse domains and languages. The results of the evaluation campaign will be presented at one of the discourse-related workshops (such as CODI) co-located with major natural language processing conferences.

Subtask 4.3. The monograph on Universal Discourse

The book will analyse the differences between the discourse annotation models used by the harmonized corpora, investigate the use of discourse markers across languages and present the project results – the discourse model, the dataset and evaluated parser prototypes.

Public Deliverables

D1.1a: Discourse relation ontology
D1.1b: Documentation of the ontology
D1.2: Discourse ontology viewer (see also GitHub repository)
D2.1: Discourse annotation guidelines
D2.2a: Converters from previous discourse representation formats to ISO format
D2.3: Annotated corpora
D3.1: Discourse parser prototypes
D3.2: Evaluation scripts
D4.1: Comparative study
D4.2: Evaluation campaign report
D4.3: The monograph on Universal Discourse

publications

Anna Latusek, Maciej Ogrodniczuk, Alina Wróblewska and Bartosz Żuk (2026). Universal Discourse Relations: A Proposal. In Chloé Braud, Christian Hardmeier, Chuyuan Li, Junyi Jessy Li, Sharid Loáiciga, Vincent Ng, Michal Novák, Maciej Ogrodniczuk, Massimo Poesio, Michael Strube, and Amir Zeldes (eds.) Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026) at ACL 2026, pp. 65–77. San Diego, California, United States. Association for Computational Linguistics.
Purificação Moura Silvano, António Leal, Aleksandra Tomaszewska, Maciej Ogrodniczuk, Martyna Lewandowska, Anna Śliwicka, Luís Filipe Cunha, Evelin Amorim, and Joana Gomes (2026). Fables-DTR: A Corpus of Fables Annotated for Discourse and Temporal Relations. In Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, and Antonio Toral (eds.) Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026), pp. 1856–1868. Palma de Mallorca, Spain. European Language Resources Association.
Maciej Ogrodniczuk and Dariusz Czerski (2026). Towards Corpus-Based Population and Visualization of ISO 24617-8 Ontology. In Harry Bunt (ed.) Proceedings of the 2026 Joint ACL – ISO Workshop on Interoperable Semantic Annotation (ISA-22) at LREC 2026, pp. 54–61. Palma de Mallorca, Spain. European Language Resources Association.
Maciej Ogrodniczuk, Anna Latusek, Karolina Saputa, Alina Wróblewska, Daniel Ziembicki, Bartosz Żuk, Martyna Lewandowska, Adam Okrasiński, Paulina Rosalska, Anna Śliwicka, Aleksandra Tomaszewska, and Sebastian Żurowski (2025). Where frameworks (dis)agree: A study of discourse segmentation. In Michael Strube, Chloe Braud, Christian Hardmeier, Junyi Jessy Li, Sharid Loaiciga, Amir Zeldes, and Chuyuan Li, editors, Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025), pp. 182–196, Suzhou, China. Association for Computational Linguistics.

resources

Harmonized Universal Discourse corpus will be published in 2027. Its Polish part will be based on the Polish Discourse Corpus.

Please take a look at the public UDisc Zotero library for external discourse-related publications.

PROJECT team

Maciej Ogrodniczuk (Principal Investigator)
Dariusz Czerski (IT expert, researcher)
Anna Latusek (Post-doc researcher)
Martyna Lewandowska (Researcher)
Adam Okrasiński (MSc student, researcher)
Paulina Rosalska (Researcher)
Michał Rudolf (IT expert, researcher)
Karolina Saputa (IT expert, researcher; December 2024 – August 2025)
Anna Śliwicka (Researcher)
Aleksandra Tomaszewska (PhD student; April–September 2025)
Alina Wróblewska (Senior Researcher)
Daniel Ziembicki (Post-doc researcher; July 2025 – January 2026)
Bartosz Żuk (PhD student)
Sebastian Żurowski (Researcher)

international collaborators

Christian Chiarcos, University of Augsburg
Purificação Silvano, University of Porto

Page updated

Google Sites

Report abuse