Institute for Systems Biology
(AIM 1) We will analyze and extract knowledge from rich real-world biomedical data sets (listed in the Resources page) in the domains of wellness, cancer, and large-scale clinical records. (AIM 2) We will formalize methods from Aim 1 to develop DOCKET, a novel tool for onboarding and integrating data from multiple domains. (AIM 3) We will work with other teams to adapt DOCKET to additional knowledge domains. ■ The DOCKET tool will offer 3 modules: (1) DOCKET Overview: Analysis of, and knowledge extraction from, an individual data set. (2) DOCKET Compare: Comparing versions of the same data set to compute confidence values, and comparing different data sets to find commonalities. (3) DOCKET Integrate: Deriving knowledge through integrating different data sets. ■ Researchers will be able to parameterize these functions, resolve inconsistencies, and derive knowledge through the command line, Jupyter notebooks, or other interfaces as specified by Translator Standards. ■ The outcome will be a collection of nodes and edges, richly annotated with context, provenance and confidence levels, ready for incorporation into the Translator Knowledge Graph (TKG). ■ All analyses and derived knowledge will be stored in standardized formats, enabling querying through the Reasoner Std API and ingestion into downstream AI assisted machine learning. ■ Example questions this will allow us to address include: (Wellness) Which clinical analytes, metabolites, proteins, microbiome taxa, etc. are significantly correlated, and which changing analytes predict transition to which disease? (Cancer) Which gene mutations in any of X pathways are associated with sensitivity or resistance to any of Y drugs, in cell lines from Z tumor types? (All data sets) Which data set entities are similar to this one? Are there significant clusters? What distinguishes between the clusters? What significant correlations of attributes can be observed?How can this set of entities be expanded by adding similar ones? How do these N versions of this data set differ, and how stable is each knowledge edge as the data set changes over time?