For PRELEARN task, we rely on ITA-PREREQ dataset (Miaschi et al. 2019), a dataset annotated with prerequisite relations between pairs of concepts on Italian.
The dataset was built upon the AL-CPL dataset (Liang et al. 2018), a collection of binary-labelled concept pairs extracted from textbooks on four domains: data mining, geometry, physics and precalculus. In AL-CPL, for each domain, relevant concepts were extracted from a textbook and matched with pages from English Wikipedia if the title and the concept name corresponded. Then, domain experts were asked to manually annotate if pairs of concepts showed a prerequisite relation or not, therefore the dataset consists of both positive and negative concept pairs. In ITA-PREREQ we took the Italian version of the Wikipedia pages considered for AL-CPL, excluding from the dataset those concepts (and the relations where they were involved) for which an Italian page was not available. Finally, we mapped both positive and negative relations between pairs of the remaining concepts from AL-CPL to ITA-PREREQ.
ITA-PREREQ dataset consists of pairs of target and prerequisite concepts (A, B), labelled as follows:
1 if B is a prerequisite of A;
0 in all other cases.
As in AL-CPL, the final dataset was expanded by creating irreflexive relations (i.e. add (B, A) as a negative sample if (A, B ) is a positive sample) and transitive pairs (i.e. add (A, C ) if both (A, B) and (B, C) are positive sample). PRELEARN participants we be provided with a “concept pairs file” containing the labelled concept pairs (one for each domain) and a “Wikipedia pages file” containing the raw text of the Wikipedia pages referring to the concepts extracted using WikiExtractor on a Wikipedia dump of Jan. 2020.
An example of the content of the “concept pairs file” is provided in the code below. Consider the first line in the example: the Prerequisite Pair (Riflessione interna totale, Luce) has "Riflessione interna totale" as target concept, while "Luce" is the prerequisite concept. On the other hand, the second line tells us that "Durezza" is not a prerequisite concept of "Plasticità (fisica)".
Similarly, the “Wikipedia pages file” will comply with this structure:
The table below describes the content of ITA-PREREQ. It can be noted that the dataset is unbalanced: the majority of concept pairs do not show a prerequisite relation (Non-prerequisite Pairs).
Please note that we will provide balanced test sets to the participants (equal number of Prerequisite Pairs and Non-prerequisite pairs).