Hermes: A Heterogeneous-Aware Multi-Tiered

Distributed I/O Buffering System

Hermes Architecture

  • Each compute node has access to
    • Local NVMe or SSD device
    • Shared Burst Buffers
    • Remote disk-based PFS
  • Hierarchy based on speed and capacity (numbered in figure)
  • Two data paths:
    • Vertical (within node)
    • Horizontal (across nodes)


  • Native support for selective buffering in:
    • Main memory (local RAM)
    • Remote main memory via RDMA
    • Burst buffers (NVRAM, SSDs)
  • Buffering optimizations:
    • Locality-aware
    • Topology-aware
    • Active buffering (by offloading computations on buffers)
    • User-defined
  • Support intelligent caching for faster read operations
    • Hot data
    • Prefetching

Modern High-Performance Computing (HPC) systems are adding extra layers to the memory and storage hierarchy, named deep memory and storage hierarchy (DMSH), to increase I/O performance. New hardware technologies, such as NVMe and SSD, have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems. The DMSH has demonstrated its strength and potential in practice. However, each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity. How to efficiently utilize the DMSH is a subject of research facing the HPC community. In this paper, we present the design and implementation of Hermes: a new, heterogeneous-aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes enables, manages, supervises, and, in some sense, extends I/O buffering to fully integrate into the DMSH. We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory, metadata, and communication management in hierarchical buffering systems. Our evaluation shows that, in addition to automatic data movement through the hierarchy, Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms.

Design and Architecture

Hermes Software stack

  • Middle-ware library, written in C++,
  • Link with applications (i.e., re-compile or LD_PRELOAD)
  • Wrap-around I/O calls
  • Modular, extensible, performance-oriented
  • Supports:
    • POSIX
    • HDF5
    • MPI-IO (ongoing)
  • Hinting mechanism to pass operations

Node design

  • Node Manager
    • Dedicated multithreaded core per node
    • MDM
    • Data Organizer
    • Messaging Service
    • Memory management
    • Prefetcher
    • Cache manager
  • RDMA-capable communication
  • Can be deployed in I/O FL

Hermes Buffering Modes


  • Synchronous
      • write-through cache,
    • stage-in
  • Asynchronous
    • write-back cache,
    • stage-out


  • Temporary scratch space
  • Intermediate results
  • In-situ analysis and visualization


  • Write-around cache

Hermes Data placement policies

Max Bandwidth

  • Maximize performance applications experience
  • Top-down approach, place data higher up and trigger down
  • Balance between bandwidth, latency, and capacity of layer
  • Default in Hermes

Maximum Data Locality

  • Maximize buffer utilization
  • Side-ways approach, place data in all layers based on dispersion unit
  • Balance between capacity and data’s spatial locality
  • Ideal for workflows that encapsulate partitioned I/O

Hot Data

  • Offer applications a fast cache for frequently accessed data
  • Hotness score based on file access frequency
  • Place hot data higher up in the hierarchy
  • Ideal for workflows with a spectrum of hot-cold data


  • Supports user-defined buffering schemas
  • Users submit an XML with requirements
  • Parsed during initialization

Evaluation results

Hermes library evaluation

RAM management

1 million fwrite() of various size and measured memory ops/sec

Metadata management

1 million metadata operations and measure MDM throughput ops/sec


1 million queue operations and measure messaging rate msg/sec

MPI shared dynamic memory window exposed in all nodes

MPI_Put(), MPI_Get() (if RDMA is present, MPI uses it)

No need for dedicated server

Indexing of windows for fast querying

Complex data structures

Update operations use MPI_EXCLUSIVE which ensure FIFO consistency

Entire window with its index is mmap’ed for fault tolerance

Workload evaluation

Alternating Compute-I/O

8x higher write performance on average

Repetitive read operations

11x higher read performance for repetitive patterns


5x higher write performance on average


7.5x higher read performance for repetitive patterns


  • Hermes hides flushing behind compute (similar to Data Elevator)
  • Hermes also hides data movement between layers behind compute
  • Hermes leverages the extra layers of the DMSH to offer higher BW
  • Hermes utilizes a concurrent flushing
  • Hermes also hides data movement between layers behind compute
  • Hermes leverages the extra layers of the DMSH to offer higher BW


Q: The DPE policies rely on the fact that users know the behavior of their application in advance which can be a bold assumption.

A: That is true. We suggest using profiling tools before hand to learn about the application’s behavior and tune Hermes. Default policy works great.

Q: How does Hermes integrate to modern HPC environments?

A: As of now, applications link to Hermes (re-compile or dynamic linking). We envision a system scheduler that also incorporates buffering resources.

Q: How are Hermes’ policies applied in multi-user environments?

A: Hermes’ Application Orchestrator was designed for multi-tenant environments. This work is described in Vidya: Performing Code-Block I/O Characterization for Data Access Optimization.

Q: What is the impact of the asynchronous data reorganization?

A: It can be severe but in scenarios where there is some computation in between I/O then it can work nicely to our advantage.

Q: What is the metadata size?

A: In our evaluation, for 1 million user files, the metadata created were 1.1GB.

Q: How to balance the data distribution across different compute nodes especially when the I/O load is imbalanced across nodes?

A: Hermes’ System Profiler provides the current status of the system (i.e., remaining capacity, etc) and DPE is aware of this before it places data in the DMSH.

Q: How to minimize extra network traffic caused by horizontal data movement?

A: Horizontal data movement can be in the way of the normal compute traffic. RDMA capable machines can help. We also suggest using the “service class” of the Infiniband network to apply priorities in the network.

Q: How is the limited RAM space partitioned between applications and Hermes?

A: Configurable by the user. Typical trade-off. More RAM to Hermes can lead to higher performance. No RAM means skip the layer.

Q: What is Hermes API?

A: Hermes captures existing I/O calls. Our own API is really simple consisting of hermes::read(…, flags) and hermes::write(…,flags). Flag system implements active buffering semantics (currently only for the burst buffer nodes).

Q: How difficult is to tune Hermes’ configuration parameters?

A: We expose a configuration_manager class which is used to pass several Hermes’ configuration parameters.


This work was supported by the

National Science Foundation

under grants no.



and CNS-0751200.