Research

Research Interests

My research focuses on the intersection of distributed systems and high-performance computing (HPC), exploring how multiple HPC facilities can collaborate through event-driven architectures to solve complex problems with enhanced resilience and efficiency. To achieve this, we are developing a hierarchical event fabric designed to advance both scientific applications—such as AI-guided simulation campaigns, time-sensitive data analysis pipelines, and distributed data integration—and resilience-enabling solutions, including a policy engine, resilient compute pools, and resilient data views. For more details on this DOE-funded Diaspora project, please refer to the project description and publications c17 and c18 in my CV.

Publications

Available on Google Scholar and in my CV.

Repositories, Artifacts, and Technical Reports

2024

[FTXS'24] Octopus: Experiences with a Hybrid Event-Driven Architecture for Distributed Scientific Computing. pre-print
paper | project descriptions [1, 2] | diaspora SDK | diaspora service repo | docs and demos | SDK walkthrough | evaluation methodology

[NRDPISI-1] Diaspora: Resilience‑Enabling Services for Real‑Time Distributed Workflows.
paper | project descriptions [1, 2] | diaspora SDK | diaspora service repo | docs and demos

[eScience'24] An Empirical Investigation of Container Building Strategies and Warm Times to Reduce Cold Starts in Scientific Computing Serverless Functions. paper | Globus compute dataset | Binder dataset

[eScience'24] TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks. paper | repo | docs

[FGCS Vol. 153] The Globus Compute Dataset: An Open Function-as-a-Service Dataset From the Edge to the Cloud. paper | dataset


2022

[OSDI'22] Cancellation in Systems: An Empirical Study of Task Cancellation Patterns and Failures. paper | poster | codebase | video

[ICC'22] Reliable Broadcast in Critical Applications: Asset Transfer and Smart Home. paper


2021

[SOSP'21] Rabia: Simplifying State-Machine Replication Through Randomization. paper | poster | video | codebase | tech. report

[ICDCN'21] Practical Experience Report: Cassandra+: Trading-Off Consistency, Latency, and Fault-tolerance in Cassandra. paper | tech. report | codebases


2020

[Computer Networks Vol.182] Reliable broadcast with trusted nodes: Energy reduction, resilience, and speed. paper | codebase

[GLOBECOM'20] BBB: A Lightweight Approach to Evaluate Private Blockchains in Clouds. paper | video | codebase

[NCA'20] CassandrEAS: Highly Available and Storage-Efficient Distributed Key-Value Store with Erasure Coding. paper | codebases

[Manuscript] Reliable Broadcast in Practical Networks: Algorithm and Evaluation. paper

[PerVehicle'20] Make Multi-hop Broadcast in VANET Fast by Selecting a Better Route for Source Vehicle. paper | slides | codebase

[DUCSAN'20] Tutorial: Google Cloud for Beginners: Architecture, Storage, and Computation. paper | video | slides | instruction

[DUCSAN'20] Tutorial: Deep Dive into Apache Cassandra: Theory, Design, and Application. paper | slides

[DUCSAN'20] LiteDoc: Make Collaborative Editing Fast, Scalable, and Robust. paper | codebases


2019

[GLOBECOM'19] Reliable Broadcast in Networks with Trusted Nodes. paper | codebase

[PRDC'19] BBB: Make Benchmarking Blockchains Configurable and Extensible. paper | codebase

[NCA'19] Distributed Causal Memory in the Presence of Byzantine Servers. paper | audio | slides

[Sarnoff'19] A First Step Towards Production-Ready Network Function Storage: Benchmarking with NFSB. paper | codebase