A digital mini-encyclopedia for the Vickaryous & Hall 2006 cell catalog based on Wikidata
Tiago Lubiana, University of São Paulo
With the rise of single-cell transcriptomics and large-scale cell characterization projects, creating a catalog of cells in discrete, hierarchically organized types is crucial for developing annotation systems. The Cell Ontology plays a key role in this by organizing diverse cell types within the OBO Foundry community, collecting and describing species- neutral cell types in a computational ontology. However, outside the biomedical ontology community, there are also significant efforts to organize cell diversity. A notable example is the comprehensive catalog of human cell types created by Brian Hall and Matthew Vickaryous in 2006, which includes 411 cell types divided into 34 assemblages. Here, I present a digital version of the Vickaryous & Hall cell catalog, constructed using Wikidata. Over 411 cell types were manually mapped to their Wikidata equivalents and linked to the 34 assemblages on the platform orthogonally to the subClassOf (P279) hierarchy. Using SPARQL queries, the mini-encyclopedia allows users to explore the taxonomy enriched with information from English Wikipedia and Cell Ontology (via Wikidata-based mappings). This project demonstrates how Wikidata can be leveraged for mini-encyclopedias through biocuration of literature-derived, semi- structured catalogs. The mini-encyclopedia enables life scientists and ontologists to navigate this opinionated cell type categorization system, understand the micro- decisions made by the authors, and compare information cataloged on Wikipedia, Wikidata, the Cell Ontology, and the CZ CELLxGENE Cell Guide. The web portal, based on live data, is accessible at https://cellcatalog.toolforge.org.
CellCards: Development of a dynamic ontology-derived ETL pipeline for automatic cell information extraction and analysis
Mary Czelusniak1, Emily Tran1, Fatima Oudeif1, Jie Zheng1, William D. Duncan2, Alexander D. Diehl3, Yongqun He1, 1 University of Michigan, Ann Arbor, MI, USA, 2 University of Florida, Gainesville, FL, USA, 3 University at Buffalo, Buffalo, NY, USA.
The CellCards knowledgebase aims to systematically gather, and represent individual cell types. This study presents our development of a dynamic extraction, transformation, and loading (ETL) pipeline designed to automatically populate the CellCards database with a vast array of cells from ontologies, including the Cell Ontology (CL) and Cell Line Ontology (CLO). The CellCards database schema includes five tables, with a key feature being the use of one table to encompass all necessary terms from the ontologies and another table to outline the relationships among these terms. The ETL process is powered by a Python script that embeds SPARQL queries directed at the Ontobee SPARQL endpoint. The final ETL program successfully extracted and loaded over 3,500 cell types from CL and 40,000 cell line entries from CLO into the new CellCards database, including the cell type name, parent cell type, synonyms, anatomical locations, etc. The gene biomarkers of cells were automatically extracted from the Common Coordinate Framework Ontology (CCFO). This enhanced database will be used to update the website and query program, with these updates scheduled for summer 2024.
Marker genes as sufficient cell type classifiers
Richard H. Scheuermann1, Beverly Peng2, Angela Liu2, Ajith V. Pankajam1, Matthew Diller1, Yun Zhang2, 1Intramural Research Program, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America, 2Department of Informatics, J. Craig Venter Institute, La Jolla, CA, United States of America.
Single-cell technologies that quantitively describe cellular phenotypes using single cell/nucleus RNA sequencing (scRNA-seq) are revealing the cellular complexity of healthy, diseased, and perturbed tissues at unprecedented granularity and scale. After cells with similar transcriptional profiles have been grouped together into cell types/states by unsupervised clustering, marker genes that represent each of these cell types are typically identified to characterize the distinctions reflected in the corresponding data clusters. This raises the question: What is a cell type marker gene? In some cases, marker genes are manually defined by domain experts to reflect known biology and facilitate integration with prior knowledge. In other cases, marker genes are defined by differential expression analysis, an approach that is data driven and based on statistical inference. Here we propose a more rigorous definition - a cell type marker gene is a gene whose product (transcript or protein) is selectively expressed in cells of a given type and can be reliably used, alone or in combination, as a canonical characteristic sufficient to classify cells of that type. Using the machine learning decision tree approach, we find that single marker genes are often not sufficient to unambiguously define a cell type and that combinations of genes are required. Further, marker gene combinations defined by manual curation or differential expression analysis can suffer from high frequencies of false negative classification, and therefore low recall. By using the F-beta metric, which balances precision and recall, the NS-Forest algorithm selects marker gene combinations that optimally classify cell types from scRNA-seq analysis. NS-Forest produces marker gene combinations that are definitional, providing canonical characteristics sufficient to classify cell types in a quantitative, testable manner that can be scaled to handle the volume of data that transcriptomic technologies produce.
Cell Ontologies, Markers, and Knowledge Graphs
David Osumi-Sutherland, Wellcome Trust Sanger Institute
Linking cell ontology terms to markers fulfils important community use cases, but the contextual nature of markers means that Cell Ontologies are not well suited to storing information about them
Markers are practical tools for identifying cell types. A gene (or set of genes) is a marker if its expression can be used to unambiguously identify (‘mark’) a cell type in some anatomical and experimental context. Expression of a gene may be unique to a cell type in one anatomical context, but not in a broader anatomical context that includes other cell types that express it. In some experimental contexts, relative levels of expression can be used as a marker of cell types, whereas in others this is not practical. Marker lists can come from historical knowledge about cell types or be derived from a variety of data sources via a variety of algorithmic approaches. Markers list vary in reliability: derived from data analysis typically depend on the reliability and accuracy of cell type annotation.
Knowledge graphs are much better suited to representing this complexity and contextual information in queryable form than ontologies.
I will present our work on a knowledge graph for the Cell Ontology, integrating: multiple curated sources of markers; evidence and provenance; thousands of consistently annotated single cell datasets + bioinformatic analysis of these datasets; the broader ontological context of CL including links to Uberon and the Gene Ontology along with genes annotated with GO terms that are linked to CL terms. I will also present a draft framework for querying the knowledge graph to create marker reports for cell types that rank results by the strength of evidence and fold in contextual information.