Scientific Literature Knowledge Bases


Knowledge bases provide a natural representation for scientific hypotheses in many domains. Manually curated ontologies and knowledge graphs are central to many scientific processes in domains such as biomedical research. The existing knowledge bases cover just a small fraction of the findings that are reported in the scientific literature. Automatic and semi-automatic approaches offer a means to accelerate the development of these resources.

This half-day workshop will focus on the development and use of knowledge bases which capture important findings found in the scientific literature. The primary goal of the workshop is to bring together researchers from the diverse fields who are interested in the development of Scientific Literature KBs. The workshop features invited talks by experienced researchers actively working in this area, lightning talks of recent work in this area, as well as an interactive panel discussion to explore synergies between diverse applications and identify key open challenges.

Key Information

Place: UMass Amherst, MA

Time: 8:30am - 12pm

Date: May 22, 2019

The workshop is part of the Automated Knowledge Base Construction (AKBC) conference.

Workshop Schedule

8:30-8:35 - Opening remarks

8:35-9:05 - Invited talk by Iris Shen: From data to knowledge – building the Microsoft Academic Graph for scientific knowledge exploration [slides available here]

9:05-9:20 - 2-minute lightning talks (1-6) [slides available here]

9:20-9:50 - Invited talk by Dina Demner-Fushman: NLP support for indexing of biomedical literature with Medical Subject Headings [slides available here]

10:10-10:40 - Invited talk by Lucy Lu Wang: Ontology-based integration of biological pathway data [slides available here]

10:40-10:55 - 2-minute lightning talks (7-11) [slides available here]

10:55-11:25 - Invited talk by Julia Lane: Where’s Waldo: Finding datasets in empirical research publications [slides available here]

11:25-11:55 - Panel discussion

11:55-12:00 - Closing remarks

Note to presenters: The workshop schedule is fairly tight. We appreciate your cooperation in starting and ending on time.

Invited Talks

1. From data to knowledge – building the Microsoft Academic Graph for scientific knowledge exploration

By Iris Shen, Principal Data Scientist, Microsoft Research Redmond

Abstract: In this Information Age, with massive amount of data and computation powers, machine has made great strides in exhibiting intelligent behaviors from collecting data to acquiring and utilizing knowledge. In this talk, we will introduce Microsoft Academic, a research project that has created cognitive agents who are simultaneously proficient in more than 660,000 fields-of-study by reading over more than a century’s worth of scholarly publications from the web. It produces the Microsoft Academic Graph (MAG), an academic domain knowledge base with six types of scientific entities: publications, authors, institutions, journals, conferences, and fields-of-study. This knowledge base currently has over 400 million entities and more than 3 billion relations. We will describe how we construct MAG, how it can be publicly accesses. A live demo will be included to show how the knowledge accumulated has played a role to provide analytics and scientific knowledge exploration.

Zhihong (Iris) Shen is a Principal Data Scientist at Microsoft Research and holds a Ph.D. in Operations Research from University of Southern California, and dual B.S. degree in Electrical Engineering and Economics from Peking University. She is the data science manager for Microsoft Academic project which leverages the cognitive power of machines to assist humans in scientific research. Her past work includes business intelligence solutions for web services and large-scale optimization applications in the supply chain management domain. She has published papers in WWW, KDD, ACL in areas of data mining, recommender system and natural language processing. Her work on supply chain management area has been published in Networks, Journal of the Operational Research Society, Computers and Industrial Engineering, and several book chapters.

2. NLP support for indexing of biomedical literature with Medical Subject Headings

By Dina Demner-Fushman, Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health (NIH)

Abstract: The talk will cover the goals of manual indexing of the biomedical literature with controlled vocabulary terms – Medical Subject Headings, the tool that supports manual indexing, Medical Text Indexer, and the recent research of deep learning, and use of the full text of the articles and tables to improve automated indexing.

Dina Demner-Fushman, Investigator, leads research in information retrieval and natural language processing at the National Library of Medicine. Dina earned a doctor of medicine degree from Kazan State Medical Institute, a clinical research Doctorate degree (PhD) in Medical Sciences from Moscow Medical and Stomatological Institute, and MS and PhD degrees in Computer Science from the University of Maryland. She is the author of more than 190 articles and book chapters in the fields of information retrieval, natural language processing, and biomedical and clinical informatics. She is a Fellow of the American College of Medical Informatics (ACMI), an Associate Editor of the Journal of the American Medical Informatics Association, and one of the founding members of the Association for Computational Linguistics Special Interest Group on biomedical natural language processing (SIGBioMed).

3. Ontology-based integration of biological pathway data

By Lucy Lu Wang, Young Investigator at the Allen Institute for Artificial Intelligence

Abstract: Biological pathways are useful tools for understanding human physiology and disease pathogenesis. Pathway analysis can be used to detect genes and functions associated with complex disease phenotypes. Pathways from different databases do not easily inter-operate due to differences in content and knowledge representation, which introduces redundancy into combined pathway datasets.

Ontologies have been used to organize biomedical data and eliminate redundancy between datasets. I mapped pathways from seven pathway databases to classes of one such ontology, the Pathway Ontology. I then generated a normalized pathway dataset by optimizing a model to align and merge pathways associated with each ontology class. These normalized pathways were evaluated against baseline pathways in pathway analysis using four public gene expression datasets. Results suggest that normalized pathways can help to reduce redundancy in enrichment outputs. The normalized pathways also retain an ontological hierarchy, which can be used to visualize enrichment results and provide hints for interpretation. Ontology-based organization of biological pathways can play a vital role in improving data quality and interoperability, and the resulting normalized pathways may have broad applications in genomic analysis.

Lucy Lu Wang is a Young Investigator at the Allen Institute for Artificial Intelligence. She recently completed her PhD in Biomedical and Health Informatics from the University of Washington. Her research interests include knowledge representation, biomedical ontology, bioNLP, and data interoperability and reuse.

4. Where’s Waldo: Finding datasets in empirical research publications

By Julia Lane, Professor, Wagner School and Center for Urban Science and Progress, New York University (NYU)

Abstract: There is a massive change in the need for an evidence basis for policy. Change at the federal level – the Federal Data Strategy as well as the Foundations of Evidence Based Policy Act – is matched by new developments in the access and use of administrative data by state and government agencies. But the new need faces a fundamental challenge. Researchers and analysts who want to use data for evidence and policy cannot easily find out who else worked with the data, on what topics and with what results. As a result, good research is underused, great data go undiscovered and are undervalued, and time and resources are wasted redoing empirical work. This paper describes the results of new project that uses text analysis and machine learning techniques to discover the “rich context” linking data sets, researchers, publications, research methods, and fields. The approach is inspired by – and is intended to enable both researchers and government analysts in data search and discovery.

Julia Lane is a Professor at the NYU Wagner Graduate School of Public Service, at the NYU Center for Urban Science and Progress, and a NYU Provostial Fellow for Innovation Analytics. She cofounded the Coleridge Initiative, whose goal is to use data to transform the way governments access and use data for the social good through training programs, research projects and a secure data facility. The approach is attracting national attention, including the Commission on Evidence Based Policy and the Federal Data Strategy. For more information, please see Julia's homepage.

Accepted Abstracts

by Timo Sztyler, Carolin Lawrence, Brandon Malone

by Justin Payan, Michael Spector, Nicholas Monath, Haw-Shiuan Chang, Andrew McCallum

by Shruthi Chari, Miao Qi, Nkechinyere N. Agu, Oshani Seneviratne, James P. McCusker, Kristin P. Bennett, Amar K. Das, Deborah L. McGuinness

by Sheshera Mysore, Zach Jensen, Edward Kim, Kevin Huang, Haw-Shiuan Chang, Emma Strubell, Jeffrey Flanigan, Andrew McCallum, Elsa Olivetti

by Sree Harsha Ramesh, Dung Thai, Boris Veytsman, Andrew McCallum

by Eric Czech

by Osnat Hakimi, Fabio Curi, Josep Lluis Gelpi, Dmitry Repchevski, María Pau Ginebra

by Dustin Wright, Yannis Katsis, Raghav Mehta, Chun-Nan Hsu

by Gabor Melli, Olga Moreira

by Linh Hoang, Richard D. Boyce, Nigel Bosch, Mathias Brochhausen, Joseph Utecht, Jodi Schneider

by Amar Viswanathan, Ioannis Akrotirianakis, Aditi Roy

Important Dates

Abstract submission: April 07, 2019 (extended to April 17, 2019), 11:59pm EST

Workshop: May 22, 2019, 8:30am - 12pm

Call for Abstracts

We welcome submissions of short abstracts (1 page) related to knowledge bases in the scientific literature. Submissions may include previously published results, late-breaking results, work in progress, datasets, among other types of scholarly work. All relevant abstracts will be accepted for a short oral presentation at the workshop in spotlight format. The submitted abstracts will be available for download on the workshop website upon request.

To submit an abstract, please send an email to with the subject line "SLKB submission: [TITLE]". Please include:

  • One-page abstract in PDF format as an attachment. Since the abstracts will not be reviewed, there is no need to anonymize the submissions.
  • Indicate whether any of the authors will be able to present the work in person at the workshop.
  • Indicate whether you need an invitation letter for visa purposes.
  • Indicate whether you'd like us to provide a link to the abstract on the workshop website.


Waleed Ammar, Allen Institute for Artificial Intelligence

Keith Hall, Google Research

Zach Ives, University of Pennsylvania

Hoifung Poon, Microsoft Research

Karin Verspoor, University of Melbourne

Table of Contents