Opening Session (ML1) Details
Monday, August 9, 9:30-12:25
(All times are listed in Central Time, UTC -5)
(All times are listed in Central Time, UTC -5)
Quicklinks to session details: Opening Session (ML1) Weather and Climate Aquatic Sciences Hydrology Translational Biology Closing Session (ML2)
Other workshop links: Workshop Home Workshop Logistics Poster Sessions Workshop Booklet
YOUTUBE LINKS: Please go to the KGML YouTube Channel for all available recorded presentations.
Session Organizers: Arindam Banerjee, Imme Ebert-Uphoff, Xiaowei Jia, Vipin Kumar, Michael Steinbach
SPEAKERS:
9:30-9:55 Vipin Kumar, University of Minnesota and NSF HDR PI: KGML2021 Workshop Introduction and Overview (Presentation Slides) (Presentation Video)
9:55- 10:25 Elizabeth Barnes, Colorado State University: Controlled Abstention Networks: neural networks that say "I Don't Know" to learn better (Presentation Slides) (Presentation Video)
10:25- 10:55 Jordan Read, United States Geological Survey: Advancing Water Prediction With Knowledge-Guided Machine Learning Partnerships: Perspectives from the U.S. Geological Survey (Presentation Slides) (Presentation Video)
10:55-11:10 BREAK
11:10-11:35 Xiaowei Jia, University of Pittsburgh: Physics-Guided Machine Learning for Model Initialization Using Physical Simulations (Presentation Video)
11:35- 12:00 Zhenong Jin, University of Minnesota: Knowledge guided machine learning for agroecosystem sustainability: applications to modeling N2O emission and ecohydrology (Presentation Slides) (Presentation Video)
12:00- 12:25 Keynote: Animashree Anandkumar, California Institute of Technology: Enabling Zero-Shot Generalization in AI4Science (Presentation Video)
Abstract: Knowledge-guided machine learning (KGML) is an emerging paradigm that uses the unique capability of data science models to automatically learn patterns and models from data, without ignoring the treasure of accumulated scientific knowledge.This paradigm is particularly applicable for improving the modeling of physical and biological systems where mechanistic (also known as process-based) models are used, and thus, KGML has the potential for accelerating discovery in a wide range of scientific and engineering disciplines.This introductory talk provides a brief introduction to why it’s critical to enhance machine learning with scientific knowledge to accelerate scientific discovery, discusses challenges and opportunities, and presents key KGML objectives that cut across a diverse set of scientific applications. While the workshop specifically focuses on four applications (hydrology, weather/climate, aquatic science, and translational biology), there are KGML challenges that are shared by these and many other scientific disciplines. The aim of this workshop is to discuss early progress in bringing scientific knowledge into machine learning, and foster interdisciplinary collaborations and interactions among diverse scientific communities.
Bio: Vipin Kumar is a Regents Professor at the University of Minnesota, where he holds the William Norris Endowed Chair in the Department of Computer Science and Engineering. He has authored over 400 research articles, and has coedited or coauthored 10 books including two text books ``Introduction to Parallel Computing'' and ``Introduction to Data Mining'', that are used world-wide and have been translated into many languages. Kumar's current major research focus is on bringing the power of big data and machine learning to understand the impact of human induced changes on the Earth and its environment. Kumar has been elected a Fellow of the American Association for Advancement for Science (AAAS), Association for Computing Machinery (ACM), Institute of Electrical and Electronics Engineers (IEEE), and Society for Industrial and Applied Mathematics (SIAM). Kumar's foundational research in data mining and high performance computing has been honored by the ACM SIGKDD 2012 Innovation Award, which is the highest award for technical excellence in the field of Knowledge Discovery and Data Mining (KDD), and the 2016 IEEE Computer Society Sidney Fernbach Award, one of IEEE Computer Society's highest awards in high-performance computing.
Abstract: The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we look for specific states of the system that lead to more predictable behavior than others, often termed ``forecasts of opportunity''. When these opportunities are not present, scientists need prediction systems that are capable of saying ``I don't know.'' We introduce a novel loss function, termed ``abstention loss'', that allows neural networks to identify forecasts of opportunity for regression and classification tasks. The abstention loss works by incorporating uncertainty in the network's prediction to identify the more confident samples and abstain (say ``I don't know'') on the less confident samples. This approach thus includes a simple method for predicting uncertainty for any neural network regression task. Unlike many methods for attaching uncertainty to neural network predictions post-training, the abstention loss is applied during training to preferentially learn from the more confident samples and is shown to outperform more standard methods for the synthetic climate use cases explored here. The implementation of the proposed loss function is straightforward in most network architectures, as it only requires modification of the output layer and loss function.
Bio: Dr. Elizabeth (Libby) Barnes is an associate professor of Atmospheric Science at Colorado State University. She joined the CSU faculty in 2013 after obtaining dual B.S. degrees (Honors) in Physics and Mathematics from the University of Minnesota, obtaining her Ph.D. in Atmospheric Science from the University of Washington, and spending a year as a NOAA Climate & Global Change Fellow at the Lamont-Doherty Earth Observatory. Professor Barnes' research is largely focused on climate variability and change and the data analysis tools used to understand it. Topics of interest include earth system predictability, jet-stream dynamics, Arctic-midlatitude connections, subseasonal-to-seasonal (S2S) prediction, and data science methods for earth system research (e.g. machine learning, causal discovery). She teaches graduate courses on fundamental atmospheric dynamics and data science and statistical analysis methods. Professor Barnes is involved in a number of research community activities. In addition to being the a lead of the new US CLIVAR Working Group: Emerging Data Science Tools for Climate Variability and Predictability, she serves on the CESM Science Steering Committee and recently finished being the lead of the NOAA MAPP S2S Prediction Task Force (2016-2020).
Abstract: The U.S. Geological Survey (USGS) is the primary monitoring and science agency for the nation’s interior. The pillars of “data” and “science” are pervasive throughout the historical establishment of the USGS and are woven throughout the observing and research priorities of the 21st century. The alignment of knowledge-guided machine learning (KGML) – a new field of study innovating the way scientific knowledge is integrated into data-driven modeling – with the twin strengths of the USGS generates unique opportunities for partnership to advance KGML and improve earth system prediction. Collaborations between USGS water scientists and academic leaders in KGML have led to updated modeling strategies for water prediction, new workforce pathways for KGML researchers, and measurable improvements to decision-relevant water forecasts. This talk will share perspectives from successful KGML partnerships that spanned federal and academic research objectives and explore future opportunities for continued innovation.
Bio: Jordan S Read is chief of the U.S. Geological Survey’s Data Science Branch in the Water Resources Mission Area. Jordan built a data science team in 2016 to advance the USGS’s ability to develop novel, accurate, timely and data-intensive modeling techniques that inform scientists and the public of changes to the quantity and quality of the nation’s water. Jordan's primary training is in physical limnology and numerical modeling and current research draws on skills in aquatic ecology, time series analyses, advanced environmental sensing, data-mining, and ecoinformatics, modeling, and synthesis. Jordan’s research themes are: 1) understand and predict impacts of changing climate and land-use on lake and stream ecosystems, and 2) the use of lakes or stream networks as model systems to test and develop predictive architectures that integrate process knowledge into a new class of advanced machine learning methods called “knowledge-guided machine learning”.
Abstract: Physics-based models are widely used to study dynamical systems in a variety of scientific and engineering problems. Although they are built based on general physical laws that govern the relationships from input to output variables, these models often produce biased simulations due to inaccurate parameterizations or approximations used to represent the true physics. In this work, we aim to build a new data-driven framework to monitor dynamical systems by extracting general scientific knowledge embodied in simulation data generated by the physics-based model. To handle the bias in simulation data caused by imperfect parameterization, we propose to extract general physical relationships jointly from multiple sets of simulations generated by a physics-based model under different physical parameters. In particular, we develop a spatio- temporal network architecture which uses its gating variables to capture the variation of physical parameters. We initialize this model using a pre-training strategy that helps discover common physical patterns shared by different sets of simulation data. Then we fine-tune it using limited observation data via a constrastive learning process. By leveraging the complementary strength of machine learning and domain knowledge, our method has been shown to produce accurate predictions, use less training samples and also generalize to out-of-sample scenarios. We further show that the method can provide insights about the variation of physical parameters over space and time.
Bio: Xiaowei Jia is an Assistant Professor in the Department of Computer Science at the University of Pittsburgh. He obtained his Ph.D. degree at the University of Minnesota, under the supervision of Prof. Vipin Kumar. Prior to that, he received his M.S. degree from State University of New York at Buffalo and his B.S. degree from University of Science and Technology of China. His research interests include physics-guided data science, spatio-temporal data mining, and deep learning. His research has been published in major journals in data mining (e.g., TKDE) and scientific journals, as well as top-tier conferences (e.g., SIGKDD, ICDM, SDM, and CIKM). Jia was the recipient of UMN Doctoral Dissertation Fellowship (2019), the Best Applied Data Science Paper Award in SDM 21, the Best Conference Paper Award in ASONAM 16, and the Best Student Paper Award in BIBE 14.
Abstract: Accurate and rapid quantification of carbon, nitrogen and water cycles throughout the agroecosystem is critical to ensure the co-sustainability of food production and environmental protection. For example, a credible way to quantify the amount of greenhouse gases (GHGs) that are sequestered or avoided in cropland is the basis for building the agricultural carbon market, which has the potential to incentivize farmers to adopt regenerative agriculture practices and generate multifaceted benefits. Cropping system models are widely used to simulate these processes. But such models have well-known limitations such as insufficient representations of the physical and biogeochemical processes, and uncertainties in a large number of model parameters. These limitations can be serious especially when applying such models across the heterogeneous landscape but with very limited observations (i.e. training data), which is often the case with GHGs. Here, we present the first application of Knowledge Guided Machine Learning (KGML) to modeling N2O fluxes and runoff that are both characterized with a “hot-spot, hot-moment” pattern. With a comprehensive dataset from a mesocosm N2O experiment and an advanced process-based model Ecosys, we investigated a range of KGMLs models with different initial inputs, hierarchical structures, pretraining, and multitask learning strategies. Results show that KGML models (R2 = 0.81 and RMSE = 3.6 mg N m-2 day-1) outperform the process-based model and pure machine learning models in accuracy, especially for the period with high peak N2O flux and complex dynamics. The proposed KGML method uses initial values generated from Ecosys to provide a solid initial state of systems. This does not only decrease the amount of data required but also increases the overall performance of the model. Similar experiments were designed for a synthetic streamflow dataset available from the CAMELS dataset. The results show that state aware multi-task formulations outperform traditional multi-task formulations. Overall, our findings demonstrated the high potential of KGML application in complex agroecosystem modeling. Importantly, our KGML model structures are quite flexible to assimilate real-time observations (e.g. satellite data) thus are scalable to large-scale applications.
Bio: Zhenong Jin is a broadly trained agroecologist whose research integrates process-based models, remote sensing, and machine learning approaches to advance the science that guides and supports agriculture sustainability. His current researches mainly focus on crop mapping and yield forecast, nitrogen (N) and phosphorus (P) cycle management, the system-of-the-systems solution for greenhouse gases (GHGs) quantification, and climate change adaptation. Dr. Jin received a Ph.D. degree from Purdue University, had his Postdoctoral training at Stanford University, and was the Lead Crop Scientist at AtlasAI, where he directed the development of high-resolution crop yield maps in Sub-Saharan African countries.
Abstract: AI holds immense promise in enabling scientific breakthroughs and discoveries in diverse areas. However, in most scenarios this is not a standard supervised learning framework. AI4science often requires zero-shot generalization to entirely new scenarios not seen during training. For instance, drug discovery requires predicting properties of new molecules that can vastly differ from training data, and AI-based PDE solvers require solving any instance of the PDE family. Such zero-shot generalization requires infusing domain knowledge and structure. I will present recent success stories in using AI to obtain 1000x speedups in solving PDEs and quantum chemistry calculations.
Bio: Anima received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She was a postdoctoral researcher at MIT from 2009 to 2010, visiting researcher at Microsoft Research New England in 2012 and 2014, assistant professor at U.C. Irvine between 2010 and 2016, associate professor at U.C. Irvine between 2016 and 2017, and principal scientist at Amazon Web Services between 2016 and 2018. She holds dual positions in academia and industry. She is a Bren professor at Caltech CMS department and a director of machine learning research at NVIDIA. At NVIDIA, she is leading the research group that develops next-generation AI algorithms. At Caltech, she is the co-director of Dolcit and co-leads the AI4science initiative, along with Yisong Yue. She has spearheaded the development of tensor algorithms, first proposed in her seminal paper. They are central to effectively processing multidimensional and multimodal data, and for achieving massive parallelism in large-scale AI applications.
Quicklinks to session details: Opening Session (ML1) Weather and Climate Aquatic Sciences Hydrology Translational Biology Closing Session (ML2)
Other workshop links: Workshop Home Workshop Logistics Poster Sessions