Platform for Explainable Distributed Infrastructure
PosEiDon aims to advance the knowledge of how simulation and machine learning (ML) methodologies can be harnessed and amplified to improve DOE’s computational and data science.
DOE science workflows are increasingly being executed on federated services infrastructures
that are complex and managed by different organizations, domains, and communities. Hence, the operators of these infrastructures and the scientists that use them have limited global visibility and consequently incomplete understanding of the behavior of the entire set of resources that science workflows span. This limited visibility makes it extremely difficult to predict performance, detect and diagnose anomalies (e.g., network congestion, I/O bottlenecks) in the infrastructure and to understand their impact on the scientists’ workflows. PosEiDon will provide an integrated platform consisting of algorithms, methods, tools, and services that help facility operators and scientists improve the overall end-to-end science workflow by (1) predicting the performance of complex workflows; (2) detecting and classifying infrastructure and workflow anomalies and “explaining” the sources of these anomalies; and (3) suggesting performance optimizations.
PosEiDon will explore the use of simulation, ML, and hybrid methods to predict, understand, and optimize the behavior of complex DOE science workflows (simulation, instrument data analysis, ML, and superfacility) on production DOE computational and data infrastructure (CDI). The solutions will be developed based on data collected from DOE and NSF testbeds and validated and refined in production CDI. PosEiDon will develop domain-informed spatiotemporal graph neural network (GNN) methods with uncertainty quantification, boosted with neural architecture search and hyperparameter optimization to automatically generate GNN models that can predict workflow execution on CDI with high accuracy. ML-based prediction methods will be compared with simulation-based performance predictions in terms of accuracy, time to prediction, and development cost (amount of data and resources). PosEiDon will also explore hybrid solutions where ML training is fed data in a targeted way from a testbed and a ML-tuned simulator. For anomaly detection, PosEiDon will explore real-time streaming ML models that detect and classify anomalies leveraging underlying spatial/temporal correlations and expert knowledge; combine heterogeneous information sources; and generate real-time predictions. Furthermore, PosEiDon will develop deep-reinforcement learning methods that can self-learn corrective behaviors and optimize workflow performance. In all methods, PosEiDon will focus on aspects of explainability and adapt and customize eXplainable AI (XAI) methods. For example, to understand how ML models predict workflow performance, PosEiDon explanations will provide insights into the workflow and platform parameter values that have significant impact on the performance. These insights will be cross-validated with CDI experts. Successful solutions will be incorporated into a prototype system with a dashboard that will be used for evaluation by DOE scientists and CDI operators.