I am an Associate Professor at University of Delaware in Department of Computer and Information Sciences (CIS), where I work on Data-intensive and High-performance systems. My research interests are in optimizing and designing intelligent infrastructure for high-performance data-intensive systems, such as parallel file systems, metadata management, graph storage, and resource management. I direct the Data Intelligence Research Lab (DIRLab)
The complete paper list can be seen in Google Scholar or my CV. Below is list of representative publications. (Note * are Ph.D, Master, or Undergraduate students mentored by me.)
Publications
ION: Navigating the HPC I/O Optimization Journey using Large Language Models
Chris Egersdoerfer*, Arnav Sareen*, Jean Luca Bez, Suren Byna, Dong Dai,
HotStorage, 2024
Github Repo, Talk (TBA)
In this study, we explored how large language models can help understand complex I/O logs and diagnose potential I/O issues.
Cross-System Analysis of Job Characterization and Scheduling in Large-Scale Computing Clusters
Di Zhang*, Monish Soundar Raj*, Bing Xie, Sheng Di, Dong Dai,
IPDPS, 2024
We conduct a comparative study of multiple workloads in HPC and AI clusters. Based on the analysis, we have eight takeaways that can be used for designing better schedulers.
DGAP: Efficient Dynamic Graph Analysis on Persistent Memory
Abdullah Al Raqibul Islam*, Dong Dai,
SC, 2023
We propose a novel graph analysis framework, DGAP, to efficiently support dynamic graph analysis on Optane persistent memory. DGAP leverages PMA-based array and PMEM-specific optimizations to deliver better performance than state-of-the-art solutions.
A Reinforcement Learning Based Backfilling Strategy for HPC Batch Jobs
the original paper has been updated with minor corrections.
Elliot Kolker-Hicks*, Di Zhang*, Dong Dai,
PMBS@SC, 2023
Github Repo (code has been updated)
We show better job runtime prediction does not always lead to better backfilling, and propose to use reinforcement learning to learn an optimized backfilling strategy.
Early Exploration of Using ChatGPT for Log-based Anomaly Detection on Parallel File Systems Logs
Chris Egersdoerfer*, Di Zhang*, Dong Dai,
HPDC (poster), 2023
We explore to use ChatGPT for log-based anomaly detection on parallel file systems logs. It shows promising accuracy and understanding about the logs.
Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis.
Di Zhang*, Chris Egersdoerfer*, Tabassum Mahmud, Mai Zheng, Dong Dai,
IPDPS, 2023
Drill is a state-of-the-art log-based anomaly detection system for large-scale storage systems using both content and context of the logs.
FaultyRank: A Graph-based Parallel File System Checker
Saisha Kamat*, Abdullah Al Raqibul Islam*, Mai Zheng, Dong Dai,
IPDPS, 2023
FaultyRank is the first graph-based parallel file system checker that can detect and fix metadata inconsistencies and corruptions. It runs faster and more accurate than the state-of-the-art checker.
VCSR: Mutable CSR Graph Format Using Vertex-Centric Packed Memory Array
Abdullah Al Raqibul Islam*, Dazhao Cheng, Dong Dai,
CCGRID, 2022
VCSR is a new mutable CSR graph format using packed memory array. Its new vertex-centric design enables fast graph updates and efficient graph traversal.
SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning
Di Zhang*, Dong Dai, Bing Xie,
HPDC, 2022
SchedInspector opportunistically delays ready jobs to improve the overall performance of the existing job scheduling policies via reinforcement learning.
A Study of Failure Recovery and Logging of High-Performance Parallel File Systems
Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang*, Dong Dai, Yong Chen, Jonathan Cook,
TOS, 2022
A comprehensive review of PFault, our ICS'18 paper.
SentiLog: Anomaly Detecting on Parallel File Systems via Log-based Sentiment Analysis
Di Zhang*, Dong Dai, Runzhou Han, Mai Zheng,
HotStorage, 2021, Best Paper Nominee
SentiLog proposes to use sentiment analysis to detect anomalies on parallel file systems.
RLScheduler: An AutomatedHPC Batch Job Scheduler Using Reinforcement Learning
Di Zhang*, Dong Dai, Youbiao He, Forrest Sheng Bao, Bing Xie,
SC, 2020
RLScheduler uses reinforcement learning (PPO) to automaticlaly learn a batch job scheduler. It achieves the best flexibility, performance, and adaptability among all the schedulers.
A Performance Study of Optane Persistent Memory: From Indexing Data Structures’ Perspective
Abdullah Al Raqibul Islam*, Anirudh Narayanan*, Christopher York*, Dong Dai,
MSST, 2020
In this paper, we systematically evaluated the performance of indexing data structures on Intel Optane persistent memory and obtained interesting observations.
Services
...(Full List in my CV)
Teaching
TBA
ITCS 5145 Parallel Computing, Graduate Course, Fall 2023, Fall 2022, Fall 2021, Spring 2020, Spring 2019
ITCS 6050/8050 Machine Learning for Efficient Computing Systems, Graduate Course, Spring 2023
ITCS 6144/8144 Operating Systems Design, Graduate Course, Spring 2019, Fall 2018
ITCS 3181 Intro to Comp Architecture, Undergraduate Graduate Course, Spring 2022, Fall 2021, Spring 2021, Fall 2020
ITSC 3050 Undergraduate Research Initiative, Undergraduate Graduate Course, Spring 2023, Fall 2023