The workshop will be held in Cleveland, OH, USA on Monday, July 13, 2026 in Classroom B (Tinkham Veale University Center, Case Western Reserve University).
Note: All presenters and attendees must register via the HPDC 2026 website.
Agenda: (all times are in Eastern Daylight Time / UTC-4)
09:00 - 09:10 – Welcome Message & Speed Introduction
09:10 - 10:00 – Keynote: Why Is My I/O Slow? Rethinking Storage Systems for Explainability
Speaker: Hariharan Devarajan, Lawrence Livermore National Laboratory (USA)
10:00 - 10:15 – Paper Talk I: Eliminating Python Overhead in Predictive Neural Compression: A Native C++/LibTorch Implementation of TEZip, Mina Yousef (University of Hyogo, Japan), Amarjit Singh and Kento Sato (RIKEN CCS, Japan).
10:15 - 10:30 – Paper Talk II: I/O Optimisation at the Compiler Level: IOOpt, Adrian Jackson (The University of Edinburgh, UK).
10:30 - 11:00 – Coffee Break
11:00 - 11:15 – Paper Talk III: Lustre Query: Periodic Offline Metadata Monitoring from MDT Backups, Sohei Koyama and Osamu Tatebe (University of Tsukuba, Japan).
11:15 - 11:30 – Paper Talk IV: Analysis of I/O Subsystem Utilization on the Theta Supercomputer, Hari Teja Jajula and Purushotham Bangalore (The University of Alabama, USA).
11:30 - 11:45 – Paper Talk V: Robust I/O Characterization of Machine Learning Workloads Across Performance Analysis Tools, Zoya Masih (Georg-August Universität Göttingen / GWDG, Germany), Radita Liem (Johannes Gutenberg University Mainz, Germany) and Julian Kunkel (Georg-August Universität Göttingen / GWDG, Germany).
11:45 - 12:00 – Paper Talk VI: CoDL: A Framework for Studying Cross-Component Interference in Deep Learning Training Pipelines, Druva Dhakshinamoorthy (BITS Pilani Goa, India), Ray Sinurat (University of Chicago, USA), Nikoli Dryden (Lawrence Livermore National Laboratory, USA), Arnab K. Paul (BITS Pilani Goa, India) and Hariharan Devarajan (Lawrence Livermore National Laboratory, USA).
12:00 - 12:15 – Paper Talk VII: Metis: Agentic Knowledge Synthesis for Explainable I/O Performance in HPC Systems, Karim Youssef (Lawrence Livermore National Laboratory, USA), Sarah Neuwirth (Johannes Gutenberg University Mainz, Germany), Neeraj Rajesh (Illinois Institute of Technology, USA) and Hariharan Devarajan (Lawrence Livermore National Laboratory, USA).
12:15 - 12:30 – Closing Remarks and Discussion
12:30 – End of REX-IO Workshop Day & Lunch Break
Keynote Abstract
Modern scientific workflows, AI pipelines, and data-intensive applications depend on increasingly complex storage ecosystems that span applications, runtimes, middleware, file systems, operating systems, networks, and heterogeneous storage devices. While the HPC community has invested heavily in monitoring infrastructure, today’s tools largely answer what happened—bandwidth achieved, I/O operations performed, or time spent in a layer—but rarely explain why performance behaved the way it did. As a result, performance variability is often attributed to system noise or complexity, leaving users and administrators to rely on intuition, manual experimentation, and expert knowledge to diagnose bottlenecks.
This keynote argues that future storage systems must be designed not only to store and move data efficiently, but also to continuously explain their behavior. Through a study of modern HPC storage architectures and monitoring ecosystems, we examine the growing gap between observability and explainability. Using a Master Architectural Plan (MAP) of contemporary storage stacks, we highlight the missing links that prevent current monitoring approaches from connecting application behavior, middleware transformations, operating-system activity, and hardware interactions into coherent explanations.
Building explainable storage systems requires a fundamental shift in both monitoring and system design. Monitoring infrastructures must provide cross-layer visibility, causal event association, workload-aware context, low-overhead continuous collection, scalable analysis, and optimization-aware telemetry. At the same time, storage systems themselves must be architected with explainability as a first-class design goal: monitoring must be native rather than bolted on, systems must capture both application and system context, telemetry must span all layers of the software and hardware stack, collected data must support causal reasoning, and monitoring must ultimately target optimization rather than merely reporting metrics.
Finally, the talk presents a vision for self-explaining storage systems—systems capable of automatically detecting anomalies, explaining bottlenecks, recommending optimizations, adapting dynamically to workload behavior, exposing provenance across the storage stack, and integrating AI-driven reasoning into performance analysis. As storage architectures become increasingly software-defined, heterogeneous, and autonomous, explainability will become as critical a design objective as performance, scalability, and reliability. The next generation of storage systems should not only move data—they should explain themselves.
Speaker Bio: Hariharan Devarajan (Lawrence Livermore National Laboratory, USA)
Hariharan Devarajan is a computer scientist in the Parallel Systems Group in the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory. His research on large-scale distributed systems focuses on data management for extreme-scale systems. His other research interests include scalable performance analysis, bottleneck detection and tuning, and data management in distributed deep-learning workloads. Hariharan has worked at LLNL since 2021 and has received best paper awards at HPDC and CCGrid conferences.
Hariharan's current research focuses on workload-aware storage for HPC systems. He contributes to several projects within the laboratory, such as UnifyFS, the Scalable Checkpoint/Restart Library (SCR) project, FRACTALE, and GOTCHA. He is the task lead for I/O in the FRACTALE project, guiding research directions for data management for MD workflows, describing complex I/O patterns within the Flux scheduler, and improving utilization through data-aware job scheduling. These projects work towards a fundamental understanding of current and emerging needs for data management in HPC systems.