Data
Dataset Description
The dataset for the DisCoTEX task contains texts originally extracted from two distinct sources: the Italian Wikipedia and the section of Italian speech transcripts included in the Multilingual TEDx corpus (mTEDx). These two sources can be considered as representative of two very different language varieties, respectively a "standard" written one and a "hybrid" one mixing genre types (e.g., university lectures, newspaper articles, conference presentations and TV science programmes), as well as different semiotic modes, such as written, spoken, audio and video (Caliendo, 2012). As widely acknowledged by research on genre and register variation, written and spoken language exploit different means to convey coherence within the text (Biber, Conrad and Reppen, 1998), thus we decided to test systems on both these types of data.
For each dataset, we extracted text passages comprising four consecutive sentences, which we considered as our unit of analysis for modelling the task. As regards Wikipedia, we relied on the existing paragraph segmentation to select four-sentence paragraphs. For the TEDx dataset, given that TED speeches lack such an internal structure, all the transcripts were automatically split into passages of four sentences. In particular:
for sub-task 1, starting from the collected passages, we created 8,000 prompt-target pairs for each domain, balanced between the positive and negative class, where the prompt is always made by the first three sentences of the passage and the target can be either the last sentence of the passage (for the positive class) or a different one (for the negative class): in the latter case, it can be a sentence extracted from a different document (for the 20% of the negative class) or the 10th sentence occurring after the prompt in the same document (for the 80% of the negative class). Both datasets were split into training and test sets with a proportion of 90% to 10%, respectively. For this task, participants are free to submit their systems for either or both datasets.
for sub-task 2, we selected a unique sample of 1,000 passages equally balanced across the two original source datasets, of which 50% were left unaltered and 50% were artificially modified in order to gradually undermine local coherence within the passage. Perturbations included shuffling the order of the sentences in all possible combinations (e.g. inverting the first with the second sentence, the third with the fourth and so on) or substituting one sentence of the passage with another one, corresponding to the 10th sentence following the passage in the same document. To collect human ratings of coherence, we administered a crowdsourcing task through the Prolific platform in which we asked Italian native speakers to assess how coherent they perceived the passage on a 1-5 Likert scale. Each passage was rated by at least 10 annotators. The final dataset was split into training and test sets with a proportion of 80% to 20%, respectively.
LIMITATIONS FOR PARTICIPANTS
In both sub-tasks, participants are free to use further external resources to train their models, with the only exception of Wikipedia and mTEDx data. The use of external resources is permitted as long as they are fully referenced in the final report.
Data Format
Datasets for all the sub-tasks are released as tab-separated text files.
For the first sub-task (Last sentence classification) we kept separated data from the two sources (i.e. Wikipedia and TED). Both versions present the following structure:
ID: a simple identifier for the entry;
PROMPT: a small snippet of text (3 sentences on average);
TARGET: the sentence for which participants are asked to assess if it is coherent with the PROMPT (i.e. it is the next sentence after the PROMPT);
CLASS: the class to be predicted. 1 stands for the positive class (i.e. the TARGET follows the PROMPT), 0 for the negative one (i.e. the TARGET does not follow the PROMPT).
For the second sub-task (Human score prediction) we mixed data from the two sources and we release a single dataset with the followng structure:
ID: a simple identifier for the entry;
TEXT: a small snippet of text (4 sentences on average), to be evaluated;
MEAN: the coherence score of the TEXT to be predicted, based on the mean of the human judgements collected;
STDEV: standard deviation of the coherence score.
Data for the official test will be provided in the same format, with the exception of the CLASS column for Sub-task 1, and the MEAN and the STDEV columns for Sub-task 2.
Examples
The following table reports some examples for the sub-task 1 .
The following table reports some examples for the sub-task 2 .
Train data download
Fill in the form to download the training data:
Test data download
Official test data for both the DisCoTEX subtasks can be found here:
https://github.com/davidecolla/DisCoTex/tree/master/test_data