Platform for Explainable Distributed Infrastructure
PosEiDon aims to advance the knowledge of how simulation and machine learning (ML) methodologies can be harnessed and amplified to improve DOE’s computational and data science.
DOE science workflows are increasingly being executed on federated services infrastructures
PosEiDon will explore the use of simulation, ML, and hybrid methods to predict, understand, and optimize the behavior of complex DOE science workflows (simulation, instrument data analysis, ML, and superfacility) on production DOE computational and data infrastructure (CDI). The solutions will be developed based on data collected from DOE and NSF testbeds and validated and refined in production CDI. PosEiDon will develop domain-informed spatiotemporal graph neural network (GNN) methods with uncertainty quantification, boosted with neural architecture search and hyperparameter optimization to automatically generate GNN models that can predict workflow execution on CDI with high accuracy. ML-based prediction methods will be compared with simulation-based performance predictions in terms of accuracy, time to prediction, and development cost (amount of data and resources). PosEiDon will also explore hybrid solutions where ML training is fed data in a targeted way from a testbed and a ML-tuned simulator. For anomaly detection, PosEiDon will explore real-time streaming ML models that detect and classify anomalies leveraging underlying spatial/temporal correlations and expert knowledge; combine heterogeneous information sources; and generate real-time predictions. Furthermore, PosEiDon will develop deep-reinforcement learning methods that can self-learn corrective behaviors and optimize workflow performance. In all methods, PosEiDon will focus on aspects of explainability and adapt and customize eXplainable AI (XAI) methods. For example, to understand how ML models predict workflow performance, PosEiDon explanations will provide insights into the workflow and platform parameter values that have significant impact on the performance. These insights will be cross-validated with CDI experts. Successful solutions will be incorporated into a prototype system with a dashboard that will be used for evaluation by DOE scientists and CDI operators.