Supporting Resources

The goal of the supporting resources is to provide the task participants with annotations from state-of-the-art automated tools in order to minimize the time-investment necessary to participate in the shared task and to allow participants to experiment on how to leverage automated analyses provided by existing Natural Language Processing systems.

Please note that Shared Task organizers are not responsible for the data quality, the resources are presented as provided by the tools. If you have any questions about the resources, please post them on the task forum: https://groups.google.com/forum/#!forum/bb-2019

Word Embeddings

Pre-trained word vectors (on demand at maiage-bibliome at inra.fr)

Word embeddings (various dimensions) trained on the 2.8 million PubMed abstracts about microorganisms. Please cite: Ferré A., Zweigenbaum P., Nédellec C. (2017). Representation of complex terms in a vector space structured by an ontology for a normalization task. BioNLP 2017, 99–106.
Vecto tool is accompanied with source corpora and pre-trained word vectors, from English Wikipedia, August 2013 dump. Please cite: Li et al. (2017) Different Syntactic Context Types and Context Representations for Learning Word Embeddings. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2411--2421
Pyysalo et al. propose various language resources created from the entire available biomedical scientific literature, a text corpus of over five billion words. Please cite: Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski and Sophia Ananiadou. Distributional Semantics Resources for Biomedical Text Processing. LBM 2013.

POS Tagging

GENIA Tagger is a tool for part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text. If you make use of the tagging from GENIA tagger, please cite: Tsuruoka et al. (2005). Developing a robust part-of-speech tagger for biomedical text. Advances in informatics, 382-392.

Parsing

Stanford Parser is a widely used statistical parser. If you make use of the parses from the Stanford Parser, please cite: Klein, D. and Manning, C. (2002). Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems.
Enju parser is a robust syntactic parser for English, based on a probabilistic HPSG grammar. If you make use of the Enju parses, please cite: Miyao, Y. and Tsujii, J. (2008). Feature forest models for probabilistic HPSG parsing. Computational Linguistics.
The C&C CCG Parser is a dependency parser. If you make use of the CCG parses, please cite: Clark, S., & Curran, J. R. (2007). Wide-coverage efficient statistical parsing with CCG and log-linear models. Computational Linguistics, 33(4), 493-552.

Term Extraction

BioYaTeA is an extended version of the YaTeA (Aubin and Hamon, 2006) term extractor adapted to the biomedical domain. If you make use of the BioYaTeA resources, please cite: Golik, W., Bossy, R., Ratkovic, Z., & Nédellec, C. (2013). Improving term extraction with linguistic analysis in the biomedical domain. In Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing13), Special Issue of the journal Research in Computing Science (pp. 24-30).

Named Entity Recognition

Stanford NER is a named entity recognition tool for person, organization and location entities. If you make use of the Stanford NER annotations, please cite: Finkel, J. R., Grenager, T., & Manning, C. (2005, June). Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 363-370). Association for Computational Linguistics.
ChemSpot is a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities, please cite: Rocktäschel, T., Weidlich, M., and Leser, U. (2012). ChemSpot: A Hybrid System for Chemical Named Entity Recognition. Bioinformatics 28 (12): 1633-1640.

Stemming, abbreviation

The Porter stemming algorithm is a process for removing the commoner morphological and inflexional endings from words in English. Please cite: Karen Sparck Jones and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4.
Ab3P is an abbreviation definition detector. A set of rules recognizes simple patterns such as Alpha Beta (AB) as well as more involved cases. The precision of each rule is estimated by applying to randomized data (psuedo-precision). Please cite: Sohn S, Comeau DC, Kim W, Wilbur WJ. (2008) Abbreviation definition identification based on automatic precision estimates. BMC Bioinformatics. 25;9:402. PubMed ID: 1881755