Everything, Everywhere, All the Time: End-to-end Software Infrastructure for Real-time Coupling of HPC, AI, and IoT
AI is powering a new generation of Internet-accessible services using Large-language Models. However, the Internet-of-Things (IoT) consists of devices that must interact with AI models without the use of human language. Physical AI uses mathematical models from physics and chemistry to enable AI reasoning and control of IoT devices that interact with the physical world. These "hard science" models and simulations typically require traditional high-performance computing (HPC) techniques and are rarely applicable to real-time or near real-time data analysis and control applications that are common in IoT settings.
In this talk, we will present xGFabric -- and end-to-end system for coupling IoT devices and actuators with HPC systems to implement ``modeling-in-the-loop'' applications. The xGFabric prototype is a full-stack, end-to-end system that uses parallel computing to execute simulations that generate training data for Physical AI "surrogate" models (Physics Informed Neural Networks and Fourier Neural Operator models) that can make inferences on demand. The resulting models are then evaluated at the edge of the network to make real-time inferences.
The prototype uses a large-scale Computational Fluid Dynamics (CFD) to train AI models for real-time decision support in a digital agriculture application that is in production use at a commercial citrus farm located in California's Central Valley. The talk will describe the xGFabric architecture and discuss experiences with deploying a production real-time HPC-Physical AI application in an agricultural setting where digital infrastructure is minimal or nonexistent.
Bio
Dr. Rich Wolski is a Professor of Computer Science at the University of California, Santa Barbara (UCSB) and co-founder of Eucalyptus Systems Inc. Having received his M.S. and Ph.D. degrees from the University of California at Davis (while a research scientist at Lawrence Livermore National Laboratory) he has also held positions at the University of California, San Diego, and the University of Tennessee, the San Diego Supercomputer Center and Lawrence Berkeley National Laboratory. Rich has led several national scale research efforts in the area of distributed systems and is the progenitor of the Eucalyptus open source cloud project.
Research Data Storage in the RI Continuum
Over the past decade, Research Infrastructure (RI) (and associated testbeds) have largely developed organically. They have been designed to meet end-user and application requirements, leveraging existing techniques, architectures and solutions as required. There have been many transitions in the architecture of such systems, from stand-alone platforms (originally high end servers then compute clusters), to Grids of interconnected machines to Cloud based solutions. These transitions have oscillated from centralised to distributed to centralised multiple times. Recent developments address widely distributed systems in the continuum, spanning centralised systems to computing at the edge. Importantly, RI solutions require computing, storage and networking as core components. While there are well developed formalisms and architectures for computation and networking, there has been less effort on formalising the structure of storage subsystems.
In this talk I will discuss a Research Data Reference Architecture (RDRA) that formalises storage architectures. The RDRA identifies 10 guiding principles, or attributes, that specify the goals of a research data storage system. Importantly, the RDRA does not mandate any particular solution, and leaves implementers free to choose appropriate solutions. Above this, have also developed a capability maturity model that evaluates the maturity of any particular implementation. The talk will show how the RDRA can be applied to RI in the continuum, and how systems can be engineered to meet application goals without dictating particular technologies or solutions. I will provide examples of how the RDRA has guided RI at the University of Queensland, including the merging of scratch and archival storage which will be presented in a companion workshop paper.
Bio
David is an Emeritus Professor of Computer Science in the School of Electrical Engineering and Computer Science, at the University of Queensland. Between 2013 and 2024 he was the Director of the University of Queensland Research Computing Centre. He has held appointments at Griffith University, CSIRO, RMIT and Monash University. Prior to joining UQ, he was the Director of the Monash e-Education Centre, Science Director of the Monash e-Research Centre, and a Professor of Computer Science in the Faculty of Information Technology at Monash. From 2007 to 2011 he was an Australian Research Council Professorial Fellow. David has expertise in High Performance Computing, distributed and parallel computing, computer architecture and software engineering. He has produced in excess of 230 research publications, and some of his work has also been integrated in commercial products. One of these, Nimrod, has been used widely in research and academia globally, and is also available as a commercial product, called EnFuzion, from Axceleon. His world-leading work in parallel debugging is sold and marketed by Cray Inc, one of the world’s leading supercomputing vendors, as a product called ccdb.
Failure Risks and Mitigations for a Wide-Area Digital Nervous System
"Listen to the Land" (Whakarongo ki te whenua) is a design for a nation-wide sensor and analytics platform for fine-scale, continuous weather monitoring to support agriculture and climate resilience. Its scale requires supercomputer-level computation and communication. However, geographical dispersal and exposure to weather (winds up to 250 km/h, temperatures from -26C to 42C, rainfall up to 200 mm/hour) make failures possible that typical supercomputers will never see. This presentation describes the proposed system, surveys some of the failure modes and proposes mitigations for them.
Bio
Nicolás Erdödy is the Founder and Director of Open Parallel Ltd., a globally distributed strategy and R&D consultancy specialized in next-generation high-tech ecosystems. Since 2010, Open Parallel has delivered bespoke technology projects and market development strategies for international clients—most notably contributing to the computing platform for the Square Kilometre Array (SKA) radiotelescope project between 2012–2019 under official selection by the New Zealand Government. The core knowledge developed for the SKA now powers Nicolás’s current initiative, “Whakarongo ki te Whenua” (Listen to the Land), a massive platform concept designed for New Zealand’s Agritech and primary sectors. Furthering this intersection of data and earth sciences, Nicolás created and leads the Birds of a Feather (BoF) series “Agriculture Empowered by Supercomputing,” featured at SC23, SC24, SC25, ISC25 (Germany), and SCAsia26 (Japan). Beyond his consultancy work, Nicolás is a Director and the Chief Commercial Officer (CCO) of SADRAM, Inc., a pioneering memory architecture company. He also serves as the Conference Director of Multicore World, a globally recognized annual summit held in New Zealand since 2012.
Nicolás holds a Master of Entrepreneurship from the University of Otago (Dunedin, New Zealand) and studied Hydraulics and Fortran at the School of Engineering, Universidad de la República (Montevideo, Uruguay).