Dong Dai, 代栋

Email: dai @ udel
Address: Fintech 416B
Phone: 302-831-0890

I am an Associate Professor at University of Delaware in Department of Computer and Information Sciences (CIS), where I work on Data-intensive and High-performance systems. My research interests are in optimizing and designing intelligent infrastructure for high-performance data-intensive systems, such as parallel file systems, metadata management, graph storage, and resource management. I direct the Data Intelligence Research Lab (DIRLab)

The complete paper list can be seen in Google Scholar or my CV. Below is list of representative publications. (Note * are Ph.D, Master, or Undergraduate students mentored by me.)

Publications

ION: Navigating the HPC I/O Optimization Journey using Large Language Models

Chris Egersdoerfer*, Arnav Sareen*, Jean Luca Bez, Suren Byna, Dong Dai,

HotStorage, 2024

Github Repo, Talk (TBA)

In this study, we explored how large language models can help understand complex I/O logs and diagnose potential I/O issues.

Cross-System Analysis of Job Characterization and Scheduling in Large-Scale Computing Clusters

Di Zhang*, Monish Soundar Raj*, Bing Xie, Sheng Di, Dong Dai,

IPDPS, 2024

Github Repo, Web App, Talk

We conduct a comparative study of multiple workloads in HPC and AI clusters. Based on the analysis, we have eight takeaways that can be used for designing better schedulers.

DGAP: Efficient Dynamic Graph Analysis on Persistent Memory

Abdullah Al Raqibul Islam*, Dong Dai,

SC, 2023

Github Repo

We propose a novel graph analysis framework, DGAP, to efficiently support dynamic graph analysis on Optane persistent memory. DGAP leverages PMA-based array and PMEM-specific optimizations to deliver better performance than state-of-the-art solutions.

A Reinforcement Learning Based Backfilling Strategy for HPC Batch Jobs

the original paper has been updated with minor corrections.

Elliot Kolker-Hicks*, Di Zhang*, Dong Dai,

PMBS@SC, 2023

Github Repo (code has been updated)

We show better job runtime prediction does not always lead to better backfilling, and propose to use reinforcement learning to learn an optimized backfilling strategy.

Early Exploration of Using ChatGPT for Log-based Anomaly Detection on Parallel File Systems Logs

Chris Egersdoerfer*, Di Zhang*, Dong Dai,

HPDC (poster), 2023

We explore to use ChatGPT for log-based anomaly detection on parallel file systems logs. It shows promising accuracy and understanding about the logs.

Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis.

Di Zhang*, Chris Egersdoerfer*, Tabassum Mahmud, Mai Zheng, Dong Dai,

IPDPS, 2023

Github Repo / talk

Drill is a state-of-the-art log-based anomaly detection system for large-scale storage systems using both content and context of the logs.

FaultyRank: A Graph-based Parallel File System Checker

Saisha Kamat*, Abdullah Al Raqibul Islam*, Mai Zheng, Dong Dai,

IPDPS, 2023

Github Repo / talk

FaultyRank is the first graph-based parallel file system checker that can detect and fix metadata inconsistencies and corruptions. It runs faster and more accurate than the state-of-the-art checker.

VCSR: Mutable CSR Graph Format Using Vertex-Centric Packed Memory Array

Abdullah Al Raqibul Islam*, Dazhao Cheng, Dong Dai,

CCGRID, 2022

Github Repo / talk

VCSR is a new mutable CSR graph format using packed memory array. Its new vertex-centric design enables fast graph updates and efficient graph traversal.

SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning

Di Zhang*, Dong Dai, Bing Xie,

HPDC, 2022

Github Repo / talk

SchedInspector opportunistically delays ready jobs to improve the overall performance of the existing job scheduling policies via reinforcement learning.

A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang*, Dong Dai, Yong Chen, Jonathan Cook,

TOS, 2022

A comprehensive review of PFault, our ICS'18 paper.

SentiLog: Anomaly Detecting on Parallel File Systems via Log-based Sentiment Analysis

Di Zhang*, Dong Dai, Runzhou Han, Mai Zheng,

HotStorage, 2021, Best Paper Nominee

Github Repo / talk

SentiLog proposes to use sentiment analysis to detect anomalies on parallel file systems.

RLScheduler: An AutomatedHPC Batch Job Scheduler Using Reinforcement Learning

Di Zhang*, Dong Dai, Youbiao He, Forrest Sheng Bao, Bing Xie,

SC, 2020

Github Repo / talk

RLScheduler uses reinforcement learning (PPO) to automaticlaly learn a batch job scheduler. It achieves the best flexibility, performance, and adaptability among all the schedulers.

A Performance Study of Optane Persistent Memory: From Indexing Data Structures’ Perspective

Abdullah Al Raqibul Islam*, Anirudh Narayanan*, Christopher York*, Dong Dai,

MSST, 2020

Github Repo

In this paper, we systematically evaluated the performance of indexing data structures on Intel Optane persistent memory and obtained interesting observations.

Services

...(Full List in my CV)

Teaching

TBA

ITCS 5145 Parallel Computing, Graduate Course, Fall 2023, Fall 2022, Fall 2021, Spring 2020, Spring 2019
ITCS 6050/8050 Machine Learning for Efficient Computing Systems, Graduate Course, Spring 2023
ITCS 6144/8144 Operating Systems Design, Graduate Course, Spring 2019, Fall 2018
ITCS 3181 Intro to Comp Architecture, Undergraduate Graduate Course, Spring 2022, Fall 2021, Spring 2021, Fall 2020
ITSC 3050 Undergraduate Research Initiative, Undergraduate Graduate Course, Spring 2023, Fall 2023

Dong Dai, 代栋

Publications

Services

Teaching

Fundings