Research

Collaborative Research: Knowledge Guided Machine Learning: A Framework for Accelerating Scientific Discovery


Motivation and Goals

Machine learning (ML) models, which have found tremendous success in several commercial applications where large-scale data is available, e.g., computer vision and natural language processing, are beginning to play an important role in advancing scientific discovery. Indeed, the role of data science in scientific disciplines is beginning to shift from providing simple analysis tools (e.g., detecting particles in Large Hadron Collider experiments) to providing full-fledged knowledge discovery frameworks (e.g., in bioinformatics and climate science). The use of data science is particularly promising in scientific problems involving processes that are not completely understood by our current body of knowledge because of the inherent complexity of the underlying phenomenon. However, the notion of black-box application of data science has met with limited success in scientific domains.

There are two primary characteristics of knowledge discovery in scientific disciplines that have prevented data science models from reaching the level of success achieved in commercial domains. First, scientific problems are often under-constrained in nature as they suffer from a paucity of representative training samples while involving a large number of variables. Further, variables in scientific data commonly show complex and non-stationary patterns that can dynamically change over time. For this reason, the limited number of labeled instances available for training or cross-validation can often fail to represent the true nature of relationships in scientific problems. Hence, standard methods for assessing and ensuring the generalizability of data science models may break down and lead to misleading conclusions. In particular, it is easy to learn spurious relationships that look deceptively good on training and test sets (even after using methods such as cross-validation), but do not generalize well outside the available labeled data. The paucity of representative samples is one of the prime challenges that differentiates scientific problems from mainstream problems involving Internet-scale data such as language translation or object recognition, where large volumes of labeled or unlabeled data have been critical in the success of recent advancements in data science such as deep learning.

The second primary characteristic of scientific domains that has limited the success of black-box data science methods is the basic nature of scientific discovery. While a common end-goal of traditional data science models is the generation of actionable models, the process of knowledge discovery in scientific domains does not end at that. Rather, it is the translation of learned patterns and relationships to interpretable theories and hypotheses that leads to advancement of scientific knowledge, e.g., by explaining or discovering the cause-effect mechanisms between variables. Hence, even if a black-box model achieves more accurate performance but produces physically inconsistent results (and thus lacks the ability to deliver a mechanistic understanding of the underlying processes), it cannot be used as a basis for subsequent scientific developments. Further, a machine learning model that is grounded by explainable theories stands a better chance at safeguarding against learning spurious patterns from the data that lead to non-generalizable performance. This is especially important when dealing with problems that are critical in nature and associated with high risks (e.g., extreme weather or collapse of an ecosystem). Hence, neither an ML-only nor a scientific knowledge-only approach can be considered sufficient for knowledge discovery in complex scientific and engineering applications. Instead, there is a need to explore the continuum between knowledge-based and ML models, where both scientific knowledge and data are integrated in a synergistic manner.

This research intends to develop a framework that uses the unique capability of data science models to automatically learn patterns and models from data, without ignoring the treasure of accumulated scientific knowledge. As indicated in Fig. 1 (above), the proposed effort builds the foundations of knowledge-guided machine learning by exploring several ways of bringing scientific knowledge and machine learning models together using pilot applications from four domains: aquatic ecodynamics, climate and weather, hydrology, and translational biology. These pilot applicationswere selected because they are at tipping points where knowledge-guided machine learning can have a transformative effect.

A major goal of this proposal is to formally conceptualize the paradigm of “knowledge-guided machine learning (KGML)”, where scientific theories are systematically integrated with machine learning models in the process of knowledge discovery. This paradigm will be broadly applicable for improving the modeling of physical and biological systems where mechanistic (also known as process-based) models are used, and thus, KGML has the potential for accelerating discovery in a range of scientific and engineering disciplines.

Science of Team Science

Realization of the KGML framework as envisioned in this project will require a sustained level of collaboration between scientific communities and researchers in machine learning who are willing to cross disciplinary boundaries and work in a tightly integrated fashion. Our interdisciplinary team is well-positioned to take up this challenge given that the team members have a long track record of collaborating very closely on related problems among themselves and a network of collaborators in state and federal organizations. Furthermore, the team will utilize the newest tools and insights from the field of science of team science to facilitate interdisciplinary collaboration and deep knowledge integration, under the guidance of the Institute for Research In the Social Sciences (IRISS) at CSU . Additionally, five workshops will be held to engage researchers from the wider scientific and engineering community in the ongoing development of KGML.

The KGML Paradigm

The Knowledge-Guided Machine Learning (KGML) paradigm aims to bring about a transformative change in the role of machine learning (ML) for accelerating scientific discovery, going far beyond the black-box application of ML. KGML is an overarching paradigm that encompasses any approach combining ML methods with scientific knowledge of varying forms in different disciplines. The development of this paradigm will require innovative new machine learning approaches and architectures that can incorporate scientific principles.

This HDR framework project is pursuing a roadmap for research in KGML, opening several new possibilities for research in this emerging paradigm. We specifically focus on building the foundations of the KGML framework for applications where domain knowledge is available in the form of mechanistic or process-based models that capture the relationships between input and output variables using known scientific principles or mechanisms.

Sample Publications:

Anuj Karpatne, Gowtham Atluri, James H. Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, Vipin Kumar. Theory-Guided Data Science: A New Paradigm for Scientific Discovery from Data. IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10, pp. 2318-2331, 1 October 2017. https://ieeexplore.ieee.org/document/7959606

Jared Willard, Xiaowei Jia, Shaoming Xu, Michael Steinbach, Vipin Kumar. Integrating Physics-Based Modeling with Machine Learning: A Survey. April 2020. https://arxiv.org/abs/2003.04919

Xiaowei Jia, Jared Willard, Anuj Karpatne, Jordan Read, Jacob Zwart, Michael Steinbach, Vipin Kumar. Physics Guided RNNs for Modeling Dynamical Systems: A Case Study in Simulating Lake Temperature Profiles. Proceedings of the 2019 SIAM International Conference on Data Mining, May 2019. doi: 101137/1.9781611975673.63 Updated, January 2020. https://arxiv.org/pdf/2001.11086.pdf

Jordan S. Read, Xiaowei Jia, Jared Willard, Alison P. Appling, Jacob A. Zwart, Samantha K. Oliver, Anuj Karpatne, Gretchen J.A. Hansen, Paul C. Hanson, William Watkins, Michael Steinbach, Vipin Kumar. Process-Guided Deep Learning Predictions of Lake Water Temperature. 2019. Water Resources Research (55). https://doi.org/10.1029/2019WR024922

Targeted Applications

The proposed effort builds the foundations of knowledge-guided machine learning by exploring several ways of bringing scientific knowledge and machine learning models together using pilot applications from four domains: aquatic ecodynamics, climate and weather, hydrology, and translational biology. The problems chosen are an ideal testbed for this HDR Frameworks proposal, not just because they are of great societal relevance but also because of the richness and interconnectedness of the problems they entail and the diversity of modeling approaches needed to address them. Only by working on problems and modeling approaches with such diversity and complexity, can one hope to build the foundation of the novel framework of KGML, whereas pursuing such a framework for isolated scenarios is likely to lead to ad-hoc solutions. The success of this project will be measured both in terms of the ability of the KGML framework to enable scientific advances in the targeted application areas and to the extent to which it lays the groundwork for a broader effort to create a comprehensive HDR institute that brings science, engineering, and data science communities together to leverage the KGML framework.

Hydrology

Primary topic:

Improve hydrologic and water quality simulation and forecasting across scale

Key ideas:

  • Understand and integrate multi-state environmental models and space-time distributed data for improving forecasting across scales ranging from hillslopes (~100m) to river basins (~100’s km).

  • Extend streamflow prediction for floods and drought in large river basins.

  • Improve hydrologic models for lake-reservoir-catchment modeling.

  • Advance integration of hydrologic and water quality data and models where upstream local landuse practices affect downstream ecosystem services and river basin resources.

Pilot applications:

  • Apply ML models (ANN, RNN, LSTM) to emulate SWAT model output of increasing complexity (heterogeneity, process inclusion, state variable dependence) for watersheds in southeastern Minnesota. Test developed ML models on field data.

  • To exploit multi-disciplinary community data-driven datasets with Physics-Guided-Machine Learning techniques (PGML) to improve forecasting. Initial focus on data sets from EPA Upper Mississippi and Chesapeake Bay, NSF Critical Zone Observatory Shale Hills.

Sample applications/presentations:

C. Duffy, G. l. Bhatt, L. Shu and A. Kemanian, 2019, Increasing the Value of Mechanistic Watershed Models Through Automation, Emulation and Machine Learning, CERF 2019 25th Biennial Conference, 3-7 November 2019, Mobile, AL (Coastal and Estuarine Research Foundation)

Refer to web page, https://sites.google.com/umn.edu/hdr-hydrology-group

Climate and Weather

Primary topic:

Improve Subseasonal-to-Seasonal (S2S) predictions

Key ideas:

  • Extending weather forecasts beyond 2 weeks is hard (butterfly effect).

  • But there are windows of opportunity, i.e. conditions when we can predict extreme weather up to 5 weeks out.

  • We will use machine learning to identify and extract patterns of such windows of opportunity.

  • The machine learning techniques must be transparent / interpretable, so that we can also learn new physics.

Pilot applications:

  • Improve prediction skill of extreme weather events at 2 weeks to 2 months.

  • Increase our understanding of underlying mechanisms that lead to that predictability.

Sample publication:

Toms, B. A., Barnes, E. A., & Ebert-Uphoff, I. (2019). Physically Interpretable Neural Networks for the Geosciences: Applications to Earth System Variability, JAMES, in press.

Aquatic Ecodynamics

Primary topic: Improve lake water quality predictions and understanding

Key ideas:

  • Water quality, such as oxygen availability for cold water fish species, derives from physical, chemical, and biological interactions in lakes. Ecosystem-scale predictions of emergent water quality characteristics, such as deep water anoxia, are challenging because of system complexities.

  • Water managers need accurate predictions of water quality across regions, even in the absence of traditional direct measurements.

  • Methods that combine the knowledge of aquatic ecologists with the power of machine learning show promise for exploiting information in a broad range of data sources to improve water quality predictions.

Pilot applications:

  • Build a process-guided machine learning model to predict oxygen dynamics over three decades in 9 well studied lakes in Wisconsin

  • Apply the model within the framework of transfer learning to expand predictions to lakes of the upper Midwest region of the US.

Sample publication:

Read, J. S., X. Jia, J. Willard, and others. 2019. Process‐guided deep learning predictions of lake water temperature. Water Resour. Res. 55: 9173–9190.

Translation Biology

Primary topic: Integrate biological knowledge and biochemical mechanisms into the machine learning models to speed up the discovery of novel genomic relationships and gene-environment interactions.

Key ideas:

  • Design graph neural networks for Modeling and Analyzing Complex Relationships.

  • Develop graph-based few-shot learning that can learn from a limited number of training examples and generalize well.

  • Develop multiview graph neural network that employs deep representation learning to learn heterogeneous information.

Pilot applications:

  • Evaluate the KGML framework toward predicting drug responses to hundreds of human cancer cell lines.

  • Combine –omics data with mechanistic models and KGML to predict tumor-specific relationships and emergent behavior.

Sample publication:

Tianle Ma and Aidong Zhang. Integrate Multi-omics Data with Biological Interaction Networks Using Multi-view Factorization AutoEncoder (MAE), BMC Genomics, December 2019.