Shared task

Discourse Unit Segmentation across Formalisms

The DISRPT 2019 workshop introduces the first iteration of a cross-formalism shared task on discourse unit segmentation. Since all major discourse parsing frameworks imply a segmentation of texts into segments, learning segmentations for and from diverse resources is a promising area for converging methods and insights. We provide training, development and test datasets from all available languages and treebanks in the RST, SDRT and PDTB formalisms, using a uniform format. Because different corpora, languages and frameworks use different guidelines for segmentation, the shared task is meant to promote design of flexible methods for dealing with various guidelines, and help to push forward the discussion of standards for discourse units. For datasets which have treebanks, we will evaluate in two different scenarios: with and without gold syntax, or otherwise using provided automatic parses for comparison.

Shared Task Data and Formats

Data for the shared task is released via GitHub together with format documentation and tools:

https://github.com/disrpt/sharedtask2019

Important dates

  • Fri, December 28, 2018 - shared task sample data release
  • Mon, January 21, 2019 - training data release
  • Fri, February 15, 2019 - test data release
  • Thu, February 28, 2019 - papers due (shared task & regular workshop papers)
  • Thu, March 7 (extended), 2019 - papers due (shared task & regular workshop papers)
  • Wed, March 27, 2019 - notification of acceptance
  • Fri, April 5, 2019 - camera-ready papers due
  • June 6, 2019 - workshop

Results for DISPT 2019 shared task

Ranks on each task are determined by macro-averaged f-score on all datasets. Individual dataset scores are micro-averaged over discourse units/connectives.

Main results

Notes:

  1. For teams that submitted multiple system, the best scoring system by macro-averaged f-score on all datasets was selected to represent the team.
  2. Scores for systems that were not deterministically seeded were collecting by averaging 5 randomly initiated runs on each data set, and macro-averages are computed from the set of 5 run averages, i.e. one mean f-score per run of all datasets, final f-score is the mean of these means.

Precision and recall breakdown by corpus