Data

Download

Use the form below to request training data.

LangLearn Shared Task @ EVALITA 2023 - DatasetsPlease fill in this form and read our license terms in order to get the data.

LangLearnEvalita2023

Corpora

For the LangLearn task, we rely on two corpora: CItA (Barbagli et al, 2016) and COWS-L2H (Davidson et al., 2020).

CItA

CItA (Corpus Italiano di Apprendenti L1) is a longitudinal corpus of essays written by the same first language (L1) students in the first and second year of lower secondary school.

The corpus was collected during the two school years 2012-2013 and 2013-2014 as part of a broader on-going study carried out in the framework of the IEA-IPS (Association for the Evaluation of Educational Achievement) activities (Lucisano, 1984). The original corpus contains a total 1,352 essays written by 156 students. The essays belong to five textual typologies, which reflect the different writing prompts students were asked to respond: reflexive, narrative, descriptive, expository and argumentative.

In addition, a prompt common to all schools was also assigned at the end of each year.

COWS-L2H

The COWS-L2H (Corpus of Written Spanish of L2 and Heritage Speakers) corpus consists of 3,498 short essays written by second language (L2) students enrolled in one of ten lower-division Spanish courses at a single American university.

Student compositions in the corpus are written in response to one of four writing prompts, which are changed periodically. During each period (an academic quarter, which consists of ten weeks of instruction) of data collection, students are asked to submit two compositions, approximately one month apart, in response to targeted writing prompts. These composition themes are designed to be relatively broad, to allow for a wide degree of creative liberty and open-ended interpretation by the writer.

Datasets

For each corpus, LangLearn participants will be provided with Training data, structured as follows:

An "essay pairs" .tsv file containing a pair of essays written by the same student, with the following information:
1. Essay_1: id of the first essay in the correct chronological order;
2. Essay_2: id of the second essay in the correct chronological order;
3. Order_1: Time of writing of the first essay;
4. Order_2: Time of writing of the second essay;
An "essays" .xml file contains the essays with randomly generated document IDs.

Note that the pairs in the Training set are always in the correct chronological order (i.e. Essay_1 is written before Essay_2). In the test set, instead, the pairs will be provided in random order.

CItA

The training set consists of 2,394 pairs, containing all students and time intervals with the exception of a set of students and intervals, that will be released as Test set.

The codes in columns "Order_1" and "Order_2" have the following format: "Year + _ + Essay number".

"Year" can be 1 or 2 depending on whether the essays were written in the first and second year of lower secondary school. "Essay number" shows the progressive number of the essay in that year. For example, 1_4 corresponds to the fourth essay written during the first year.

COWS-L2H

The training set consists of 1,009 pairs of essays written by students in university-level Spanish as second language courses. The test set will contain 320 pairs of essays. There is no overlap between the set of students who wrote the essays in the training set and the set of students who wrote the essays in the test set, but both sets cover the same courses.

The codes in columns "Order_1" and "Order_2" contain time information and should be read as follows: the academic terms (quarters) and year. For example, F20 is Fall 2020. The academic terms cover the following time spans: W goes from January to March, S from April to June, SU from July to September, and F from October to December.

Page updated

Google Sites

Report abuse

Data

Download

Use the form below to request training data.

Corpora

CItA

COWS-L2H

Datasets

CItA

COWS-L2H