Research

Rare Diseases Corpus

The dataset of 842 clinical records for research purposes and referencing correctly the source can be downloaded from here (xlsx format).

last update 20/6/2023

Back

CLEF 2018 Conference and Labs of the Evaluation Forum

Additional information for "UNSL's participation at eRisk 2018 Lab" publication, see FTVT and SIC Technical Report.

last update 27/6/2018

Back

SpanText: formal corpus for Author Profiling Task

Data collection of formal texts in Spanish language. Two versions of the SpanText corpus are presented. The balanced version has a similar number of documents in each category. The unbalanced version has different amount of documents in the categories but the number is proportional to the one corresponding to the Spanish corpus of PAN-2013. Gender and age were considered as characteristics for predicting the author profile of a text. From the combination of the gender and age of the author, six classes can be derived: 10smal, 10sfem, 20smal, 20sfem, 30smal, 30sfem. Next tables summarize the main characteristics of each version of SpanText corpus.

Characteristics of Spantext balanced and unbalanced versions

For a more detailed explanation of SpanText corpus and citation please refer to:

"A Spanish text corpus for the author profiling task", (María Paula Villegas, María JoséGarciarena Ucelay, Marcelo Luis Errecalde and Leticia Cecilia Cagnina), in Proc. of XX Congreso Argentino de Ciencias dela Computación (CACIC 2014). San Justo, Buenos Aires, Argentina, 2014, ISBN 978-987-3806-05-6.

SpanText corpus can be downloaded and used freely for research purposes and should be cited adequately.

Download SpanText Corpus (balanced and unbalanced versions)

last update 18/11/2014

Back

Data sets used for short-text clustering

Eleven corpora with different levels of complexity with respect to the size, length of documents and vocabulary overlapping are available: Micro4News, EasyAbstracts, SEPLN-CICLing, CICLing-2002, R4, R6, R8B, JRC6, R8-Test, JRC-Full and R8-Train. The next table shows some general features of these corpora: corpus size in Kbytes (CS), number of categories and documents (|C| and n respectively), total number of terms in the corpus (|T|), vocabulary size (|V|) and average number of terms per document (Td).

The first eight corpora are considered small (less than 1000 documents). Micro4News, EasyAbstracts, SEPLN-CICLing and CICLing-2002 deal with news and abstracts of scientific papers. Micro4News is a lower complexity collection constructed with short-length documents (although longer than the next three corpora) that correspond to four very different topics of the popular 20Newsgroups corpus. EasyAbstracts, SEPLN-CICLing and CICLing-2002, correspond to short-length documents (abstracts of scientific papers) that mainly differ in the closeness among the topics of their categories. The EasyAbstracts corpus with scientific abstracts on well differentiated topics can be considered a medium complexity corpus but the CICLing-2002 corpus with narrow domain abstracts is a relatively high complexity corpus. This corpus, generated with abstracts of articles was presented at the CICLing 2002 conference (http://www.cicling.org/2002/).The next three small corpora are subsets of the well known R8-Test corpus, a subcollection of the Reuters-21578 dataset. These corpora were artificially generated to consider corpora with different number of groups: four groups (R4), six groups (R6) and eight groups (R8B). JRC6 refers to a subcollection of JRC-Acquis [Steinberger, 2006], a popular corpus with legal documents and laws corresponding to different countries of the European Union. In order to experiment with short texts, this sub-collection only contains some of the shortest documents of six different groups of the original JRC-Acquis. The three last larger corpora are cataloged as medium size (with a number of documents between 1000 and 10000). R8-Test and R8-Train corpora are widely used in many works. A larger version of JRC6 corpus named JRC-Full contains a larger amount of short documents (in fact, all the short texts of six categories).

The complete corpora (.rar) can be downloaded for research purposes and should be cited adequately.

Back

Page updated

Google Sites

Report abuse