Data-Driven Exploration of Interconnected Risks in Complex Human–Natural Systems
Risk identification & quantification in complex human-natural systems via convergent data intensive research.
About
Many key areas of social and scientific importance, such as climate, finance, energy, transportation, and ecology, can be viewed as a complex network of interdependent processes. These connections mean that small events in one area may accrue through the network and wreak havoc on the entire system. Currently, there is no single discipline that is equipped to identify broader signs of systemic risk and mitigation targets. In this workshop, we will look to study how risks in different domains are connected. For instance, what risks in agriculture, ecology, energy, finance, and hydrology are heightened by climate variability and change? How might risks in, for example, space weather, be connected with energy, water, and finance? Recent advances in computing and data science and the data revolution in each of these domains have provided a means to address these questions.
The workshop will focus on techniques and tools that enable the exploration of large-scale, complex systems (i.e., systems composed of many highly interconnected components that produce nonlinear, adaptive, emergent behavior) via multi-resolution dynamic datasets. We seek ideas and applications that will allow scientists to collaborate across disparate domains and integrate analyses of datasets on different time scales and resolutions in order to improve the prediction of risks (potentials for extreme outcomes and system failures). Topics may include but are not limited to new approaches in data representation and integration that facilitate cross disciplinary collaboration, case studies demonstrating collaboration of multiple domains via data science, data mining and science techniques to quantify systemic risk as well as cutting-edge applications of complex systems analyses.
We invite submissions of novel research on real-world data systems and applications, industrial experiences and identification of challenges that deploy research ideas in practical applications with a focus on interdisciplinary risks in human-natural systems.
The central goal of this workshop at KDD 2021 is to identify interrelationships between domains and propose novel approaches to understanding and quantifying systemic risk in our human-natural systems.
Topics of Interest
Defining & quantifying risk (assessment) from data
Combining multiresolution data
Knowledge discovery for human-natural systems
Complex systems/network analysis
Systemic risk measures
Data science methods for facilitating collaboration across domains
Cross-disciplinary research
Linking data and domains via data science
Tentative Schedule
4-8 PM ET, August 15, 2021 (4-8 AM SG, August 16, 2021)
4:00-4:05 PRISM team kick off
4:05-4:40 Megan Konar
4:40-5:15 Cate Kalder
5:15-5:50 Bob Cruickshank
5:50-6:25 Upmanu Lall
6:25-6:45 PRISM team
6:45-7:55 poster session
7:55-8 Wrap up
Invited Speakers
Abstract: Research on ‘neighborhood effects’ focuses on linking features of social contexts or exposures to health, educational, and criminological outcomes. Traditionally in the literature on neighborhood effects, individuals are assigned to a specific neighborhood, frequently operationalized as the census tract of residence, which may or may not contain the locations of their routine activities. In order to better characterize the many social contexts to which individuals are exposed as a result of the spatially- and temporally-distributed locations of their routine activities and to understand the consequences of these socio-spatial exposures, we have developed the concept of ecological networks. Ecological networks are two-mode networks that indirectly link individuals through the spatial overlap in their routine activities. This presentation focuses on statistical and machine learning methodology for identifying communities from ecological networks that capture individuals who have similar regular activity patterns. We apply these methods to activity-pattern data collected using GPS-enabled cell phones as part of a large-scale study of youth and their caregivers in Columbus, OH.
Bio: Catherine (Kate) Calder is a professor in the Department Statistics & Data Sciences at University of Texas at Austin and currently serves as department chair. Prior to moving in UT Austin in 2019, she spent 16 years on the faculty of The Ohio State University. She served as an associate director (2015–2018) and co-director (2018–2019) of the Mathematical Biosciences Institute, an NSF Division of Mathematical Sciences Research Institute located on the Ohio State campus. At UT Austin, she is the Scientific and Technical Core Director of the UT Population Research Center. She currently serves as an associate editor for the Annals of Applied Statistics and Bayesian Analysis and has served the profession through various elected roles in sections of the American Statistical Association (ASA) and in the International Society for Bayesian Analysis. Her research has been funded by the NIH, NSF, NASA, and other agencies and foundations. She received the ASA Section on Statistics and the Environment’s 2013 Young Investigator Award and was elected Fellow of the ASA in 2014. Dr. Calder's current research focuses on spatio-temporal statistics, Bayesian methods, and network analysis. Her work is motivated by applications in the environmental, social, and health sciences.
Abstract: Food consumption and production are separated in space through flows of food along complex supply chains. These food supply chains are critical to our food security, making it important to evaluate them. However, detailed spatial information on food flows within countries is rare. The goal of this paper is to estimate food flows between all county pairs within the United States. To do this, we develop the Food Flow Model, a data-driven methodology to estimate spatially explicit food flows. The Food Flow Model integrates machine learning, network properties, production and consumption statistics, mass balance constraints, and linear programming. Specifically, we downscale empirical information on food flows between 132 Freight Analysis Framework locations (17,292 potential links) to the 3,142 counties and county-equivalents of the United States (9,869,022 potential links). Subnational food flow estimates can be used in future work to improve our understanding of vulnerabilities within a national food supply chain, determine critical infrastructures, and enable spatially detailed footprint assessments.
Bio: Megan Konar is an associate professor in the Department of Civil and Environmental Engineering at the University of Illinois at Urbana-Champaign. Prof Konar's research focuses on the intersection of water, food, and trade. Her research is interdisciplinary and draws from hydrology, environmental science, and economics. Dr. Konar received a PhD in Civil and Environmental Engineering from Princeton University in 2012, MS in Water Science, Policy and Management from Oxford University in 2005, and BS in Conservation and Resource Studies from UC Berkeley in 2002. She was recently awarded the NSF CAREER award and Early Career Award from AGU Hydrologic Sciences.
Abstract: The broadband industry recently delivered two innovations that leverage global Hybrid Fiber-Coax (HFC) networks to help monitor the electric power grid. The first is the Gridmetrics™ Power Event Notification System, PENS™, which provides new insights into the performance and availability of secondary distribution grids. The PENS product fills an immediate need for improved situational awareness of grid failures and is available as an Esri data layer, via Twitter, and online at www.gridmetrics.io. Motivated by the successes of PENS and recognizing that existing sensing capabilities are out of date, the second innovation is the new ANSI/SCTE XX U.S. National Standard, Requirements for Power Sensing in Cable and Utility Networks, which includes observing and communicating voltage and current at up10 kHz with a precision of 0.002 per-unit, a timestamp resolution <= 1 microsecond, and clock accuracy <= 1/2 microsecond. The new ability to backhaul uncompressed continuous point on wave (CPOW) power quality observations is a quantum leap beyond traditional phasor measurement units that creates a plethora of greenfield opportunities in research and network operations. Together, these two innovations pave the way for rapid development and global adoption of open-standards-based smart grid monitoring.
Bio: Robert Cruickshank is a technology strategist, inventor, and implementer of optimal load shaping that maximizes renewable energy and minimizes fuel costs and emissions. A 40-year researcher in telecommunications, Dr. Cruickshank developed smart home devices at AT&T Bell Laboratories and currently develops electric grid monitoring and control applications with Cable Television Laboratories and several U.S. DOE National Labs.
Abstract: Nonparametric and machine learning methods have significantly advanced our ability to model non-Gaussian and non-stationary processes. Deep Learning methods, Bayesian multi-level, non-homogeneous Hidden Markov/Semi-Markov Models, as well as Wavelets in different combinations have emerged as useful constructs for space-time models of physical phenomena and temporal sequences of images. Many of these methods are computationally intensive, and may require the specification and choice of specific parametric functional forms and inference as to their parameters. Often, the candidate function space for these models can be quite large, and only a finite sub-set is explored. By contrast, fully non-parametric methods, such as kernel methods or k-nearest neighbor (k-nn) methods have been difficult to apply with high dimensional multivariate or spatio-temporal data due to the curse of dimensionality in the presence of a finite data set. Thus, while ideas from nonlinear dynamical systems have been applied extensively to empirically reconstruct low dimensional attractors from time series in one (with time lag embedding) or a few variables, there has been rather limited progress in extending these methods to high dimensional spatio-temporal fields.
We present an extension to a k-nearest neighbor time series simulation and forecasting algorithm that we originally developed (Lall and Sharma, 1996) for modeling nonlinear, non-Gaussian time series to consider spatio-temporal dynamics of multiple fields. The example application presented considers daily or hourly wind and solar energy spatial fields at each of 216 locations in Texas covering 50 years of re-analysis data, i.e. a total dimension of 432 variables with time series structure. Both variables are bounded, and non-Gaussian. Spatial dependence is local as well as regional. The dynamics vary by time of day and by season, and an examination of individual time series reveals non-linearity in the recurrence functions. Our goal is to model the joint spatio-temporal variation of these two fields, preserving the marginal density of each variable at each location, the spatial and temporal correlation, the temporal spectra, and the statistics of the run lengths of persistent shortages and surpluses in the aggregate energy produced from wind and solar across Texas. This covers local statistics as well as aggregate statistics, and the expectation is that a model that reproduces these statistics under space-time simulation may also be effective for near term forecasting of the spatial fields, preserving the spatial structure and the aggregate energy output from the domain. This example illustrates the kinds of applications we envisage, and expect that the model would also be useful for simulating other hydroclimatic variables (including their joint space-time dependence), pandemics, financial time series etc.
The algorithm is briefly sketched as follows. First, regular tools from nonlinear time series embedding are used to select an appropriate time delay and embedding dimension for each time series. This is consistent with the approach in Lall and Sharma (1996), in that the k-nn of the state space described embedding at a given time are then identified from the historical data as "analogs" of the current state space, and the next time series value is then simulated probabilistically or forecast using the successors of these k-nn. Next, we consider similarity in the dynamics of each of the time series. If two series are perfectly correlated, then the time indices of the k-nn identified at a certain time for those two time series will be identical. Thus, a basis for dynamic similarity across time series is established. Let the probability assigned to the k_th neighbor for series j at time t be p_jk(t). Now identify all the unique time series indices for the k-nn across all time series at that time and the resulting matrix (rows are unique k-nn time index, columns are site/variable, and entries are the corresponding p_jk) at time t be P(t). From the perspective of the spatial fields being modeled, we then seek the "best" k-nearest neighbors that account for the similarity in the dynamics of the individual time series. We consider two cases. The first is that the underlying dynamics is homogeneous, i.e., the space-time fields represent a realization from a common underlying dynamical model with some variation in the parameters across the sites. In this case, the probability assigned to each of the candidate unique k-nn time indices is simply across sites (columns of P) of the p_jk, reflecting the joint probability of a historical time index as representative of the dynamics at the current time across all series. The second case is non-homogeneous, where we consider multiple dynamical mechanisms that may operate at different sites. In this case we consider a cluster analysis of P across sites, to group the sites into sub-groups that have the maximum similarity within group and dissimilarity across groups based on the p_jk values assigned to the unique knn time indices identified. Then the simulation proceeds in the same way for each cluster separately. This procedure is applied recursively to generate spatial fields for as many time steps as desired. The process effectively leads to a bootstrap, since we resample historical values of the spatial field conditionally given the current state of the fields. A trivial extension is to allow new values to be sampled by drawing from a smooth, random perturbation of the empirical cumulative distribution function of each series (or from a parametric marginal distribution) fit to each individual series. The process developed can be thought of as a nonparametric spatio-temporal kernel applied to the data, under the assumption that the temporal dynamics can be identified independently first and then spatial similarity in the temporal dynamics used to complete the space-time conditioning.
Poster Session
"Quantifying systemic risk of disruptions and calculating impacts of mitigations in interconnected systems interfacing between human, environment, economy, and technology using agent-based Synthetic Modelling Environments." Myrna Bittner, Dean Bittner and Sama Ahmed
"Uncertainty Quantification through Dual Maxima and Minima Autoregressive Conditional Fréchet Models for High-dimensional Financial Time Series." Yu Chen, Tiantian Mao and Zhengjun Zhang
"A Data Library to Archive, Analyze, Visualize and Serve Online Datasets from Multiple Domains in an Interoperable Framework" Rémi Cousin, John del Corral and Drew Resnick
"Novel biodiversity indicators based on finance metrics." Mei-Ling Feng, Mila Getmansky Sherman, Toryn L. J. Schafer, Christian Che-Castaldo, David S. Matteson and Judy Che-Castaldo
"Population Health Effects of Temperature Change due to Carbon Emissions: Probabilistic Forecasts." Kai Fukutaki, Bronte Dalton, Stein Emil Vollset, Katrin Burkart, Joseph Dieleman and Jeffrey Stanaway
"Role of Variable Renewable Energy Penetration on Electricity Price and its Volatility across Independent System Operators in the United States" Olukunle O Owolabi, Toryn L.J. Schafer, Georgia E. Smits, Sanhita Sengupta, Sean E. Ryan, Lan Wang, David S. Matteson, Mila Getmansky Sherman and Deborah A. Sunter
"Data sharing facilities integrating species biology into species distribution models developed for biodiversity conservation." Tyler Schartel, Yong Cao, Bridget Henning-Randa, Mei-Ling Feng and Leon Hinz Jr.
Submission
Deadlines and Dates
Paper submission: May 20, 2021 Extended deadline: June 10, 2021 23:59 anywhere on Earth
Acceptance notifications: July 10, 2021
Workshop date: 4-8 PM ET, August 15, 2021 (4-8 AM SG, August 16, 2021)
For an oral presentation, full-length papers or extended abstracts of up to 5 pages including figures and references are welcome. Work-in-progress papers are encouraged, including papers being considered for publication elsewhere. Submissions must be in PDF format and formatted according to the new Standard ACM Conference Proceedings Template (https://www.acm.org/publications/proceedings-template). For a poster submission we ask for an abstract of up to half a page (about 300 words) in PDF format.
All papers will be peer-reviewed. Reviews are not double-blind, and author names and affiliations should be listed. Please use the ACM guidelines to format your paper. If accepted, at least one of the authors must attend the workshop to present the work. We can only accommodate a limited number of oral presentations and some paper submissions may be converted to poster presentations.
All submissions can be made through EasyChair using the following link: https://easychair.org/conferences/?conf=prismkdd1
For poster submissions, please indicate the intent for a poster presentation at the beginning of the abstract description.
All presenters will be required to register for the KDD 2021 conference (https://kdd.org/kdd2021/attending)
Program Committee
David S. Matteson, Cornell University
Ryan McGranaghan, ASTRA
Mila Getmanksy Sherman, UMASS Amherst
Marie Düker, Cornell University
Michael Jauch, Cornell University
Sean Ryan, Cornell University
Toryn Schafer, Cornell University
Mei-Ling (Emily) Feng, Lincoln Park Zoo
Olukunle Owolabi, Tufts University