Corpus

The following files contain distantly labelled and manually labelled data of the two UMLS Metathesaurus relations may-treat and may-prevent. First Medline abstracts were distantly annotated using the UMLS Metathesaurus. Then 200 positive and 200 negative examples of each relation were selected and re-annotated by two biomedical experts. The gs-DL data describes the distantly labelled data set and the gs-ML the manually labelled data. The Louhi 2015 paper contains a more detailed description of the two data sets.

The experiment in ACL 2015 runs the experiment only on the gs-DL data.

The corpus can be downloaded here: gs-DL (may-prevent, may-treat) and gs-ML (may-prevent, may-treat).

If you use our corpus, please cite one of our publications:

Improving distant supervision using inference learning, Roland Roller, Eneko Agirre, Aitor Soroa and Mark Stevenson, (to appear) In Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference of the Asian Federation of Natural Language Processing, Beijing, China, 2015 [bibtex] [pdf]

Held-out versus Gold Standard: Comparison of Evaluation Strategies for Distantly Supervised Relation Extraction from Medline abstracts., Roland Roller and Mark Stevenson, In Proceedings of the The Sixth International Workshop on Health Text Mining and Information Analysis (Louhi), Lisbon, Portugal
[bibtex] [pdf]