PhenoMiner

The PhenoMiner project ran from October 2012 to December 2014 at the European Bioinformatics Institute (EMBL-EBI) at the Wellcome Trust Genome Campus in Hinxton, Cambridge, UK.

About

Phenotypes play a key role in inferring the complex relationships between genes and human heritable diseases. PhenoMiner is a research project aimed at the capture and encoding of phenotypes in the scientific literature. This should provide insights into the complex processes involved in human diseases as well as enabling semantic interoperability with existing biomedical ontologies such as those that describe human anatomy, genetics and behaviours.

PhenoMiner is based on text/data-mining technology - natural language processing, machine learning and conceptual analysis. It builds on insights gained from semantic parsing to extract structured information about phenotypes from whole sentences - in contrast to existing techniques which often apply string matching. The system exploits the wealth of scientific data locked within the scientific literature in databases such as PubMed Central and Europe PMC to extract the semantic vocabulary of phenotypes that scientists use. The system will provide scientists, clinicians and informaticians with the data and tools they need to gain new insights into Mendelian diseases.

Publications

  • Collier, N., Tran, M. V., Le, H. Q., Oellirch, A., Kawazoe, A., Hall-May, M. and Rebholz-Schuhmann, D. (2012), "A hybrid approach to finding phenotype candidates in genetic texts", in Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India, December 10-14.
  • Collier, N., Oellrich, A. and Groza, T. (2013), “Toward knowledge support for analysis and interpretation of complex traits”, Genome Biology 14(9):214.[html]
  • Groza, T., Oellrich, A., & Collier, N. (2013), “Using silver and semi-gold standard corpora to compare open named entity recognisers”, in 2013 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2013, pp. 481-485.
  • Tran, M. V., Le, H. Q., Phi, V., T., Pham, T.B. and Collier (2013), "Exploring a probabilistic Earley parser for event composition in biomedical texts", in Proceedings of the BioNLP workshop shared task at ACL 2013, Sophia, Bulgaria, pp. 130-134.
  • [html].
  • Collier, N., Tran, M., Le, H. Ha, Q., Oellrich, A. Rebholz-Schuhmann, D. (2013), “Learning to recognize phenotype candidates in the auto-immune literature using SVM re-ranking”, PLoS One 8(10): e72965.[html][pdf]
    • Collier, N., Paster, F. and Tran, M. V (2014), "The impact of near domain transfer on biomedical named entity recognition", in Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi) at EACL, pp. 11-20. [pdf]
    • Collier, N., Oellrich, A. and Groza, T. (2014), "Concept selection for phenotypes and disease-related annotations using support vector machines" in Proc. PhenoDay and Bio-Ontologies at ISMB 2014. [pdf]
    • Collier, N., Kafkas, S., Kim, J.H. and McEntyre, J. (2015), "OMIM concept annotation: Steps towards automated tagging the disease iterature using PhenoMiner phenotypes", Force 2015 Research Communications and e-Scholarship Conference, Oxford, 12-13 January. [pdf poster].
    • Kafkas, S., Kim, J.H., McEntyre, J. and Collier, N. (2015), "Analysis of PhenoMiner phenotypes in the open access full text literature", Force 2015 Research Communications and e-Scholarship Conference, Oxford, 12-13 January. [pdf poster].
    • Oellrich, A., Collier, N., Smedley, D., & Groza, T. (2015). Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes. PloS one, 10(1), e0116040.[pdf]
    • Groza, T., Kohler, S., Dolken, S., Collier, N., Oellrich, A., Smedley, D., Couto, F. M., Baynam, G., Zankl, A. and Robinson, P. N., "Automatic concept recognition in the Human Phenotype Ontology reference gold standard and test suites corpora", Database, OUP (in press).
    • Collier, N., Groza, T., Smedley, D., Robinson, P, N., Oellrich, A. and Rebholz-Schuhmann, D. (2015), "PhenoMiner: from text to a database of phenotypes associated with OMIM disorders", under submission for Database, OUP.
    • Collier, N., Oellrich, A. and Groza, T. (2015), "Concept selection for phenotypes and diseases using learn to rank", under submission for the Journal of Biomedical Semantics.

Data Sets

Collier, N., Paster, F. and Tran, M. V (2014), "The impact of near domain transfer on biomedical named entity recognition", in Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi) at EACL, pp. 11-20.[pdf]

Outreach

Collaborators

    • Dietrich Rebholz-Schuhmann (University of Zurich)
    • Damian Smedley (Wellcome Trust Sanger Institute)
    • Anika Oellrich (WellcomeTrust Sanger Institute)
    • Tudor Groza (University of Queensland)
    • Peter Robinson (Charite Universitatsmedizin Berlin)
    • Vu Tran Mai (University of Vietnam)
    • Jo McEntyre (EMBL-EBI)

Funding

PhenoMiner is funded by an FP7 Marie Curie Fellowship (grant 301806).