Data

Dataset Description

The dataset for the DisCoTEX task contains texts originally extracted from two distinct sources: the Italian Wikipedia and the section of Italian speech transcripts included in the Multilingual TEDx corpus (mTEDx). These two sources can be considered as representative of two very different language varieties, respectively a "standard" written one and a "hybrid" one mixing genre types (e.g., university lectures, newspaper articles, conference presentations and TV science programmes), as well as different semiotic modes, such as written, spoken, audio and video (Caliendo, 2012). As widely acknowledged by research on genre and register variation, written and spoken language exploit different means to convey coherence within the text (Biber, Conrad and Reppen, 1998), thus we decided to test systems on both these types of data.

For each dataset, we extracted text passages comprising four consecutive sentences, which we considered as our unit of analysis for modelling the task. As regards Wikipedia, we relied on the existing paragraph segmentation to select four-sentence paragraphs. For the TEDx dataset, given that TED speeches lack such an internal structure, all the transcripts were automatically split into passages of four sentences. In particular: 


LIMITATIONS FOR PARTICIPANTS

In both sub-tasks, participants are free to use further external resources to train their models, with the only exception of Wikipedia and mTEDx data. The use of external resources is permitted as long as they are fully referenced in the final report.

Data Format

Datasets for all the sub-tasks are released as tab-separated text files. 

For the first sub-task (Last sentence classification) we kept separated data from the two sources (i.e. Wikipedia and TED). Both versions present the following structure:

For the second sub-task (Human score prediction) we mixed data from the two sources and we release a single dataset with the followng structure:


Data for the official test will be provided in the same format, with the exception of the CLASS column for Sub-task 1, and the MEAN and the STDEV columns for Sub-task 2.

Examples

The following table reports some examples for the sub-task 1 .

The following table reports some examples for the sub-task 2 .

Train data download

Fill in the form to download the training data:

https://forms.gle/tFnvN4Fn4QYwS29Z8

Test data download

Official test data for both the DisCoTEX subtasks can be found here:

https://github.com/davidecolla/DisCoTex/tree/master/test_data