LinkedInLinkTwitterGitHub

Dong Dai,   代栋

I am an Associate Professor at University of Delaware in Department of Computer and Information Sciences (CIS), where I work on Data-intensive and High-performance systems. My research interests are in optimizing and designing intelligent infrastructure for high-performance data-intensive systems, such as parallel file systems, metadata management, graph storage, and resource management. I direct the Data Intelligence Research Lab (DIRLab)

The complete paper list can be seen in Google Scholar or my CV. Below is list of representative publications. (Note * are Ph.D, Master, or Undergraduate students mentored by me.)

Publications

ION: Navigating the HPC I/O Optimization Journey using Large Language Models

Chris Egersdoerfer*, Arnav Sareen*, Jean Luca Bez, Suren Byna, Dong Dai,

HotStorage, 2024

Github Repo, Talk (TBA)

In this study, we explored how large language models can help understand complex I/O logs and diagnose potential I/O issues.

Cross-System Analysis of Job Characterization and Scheduling in Large-Scale Computing Clusters

Di Zhang*, Monish Soundar Raj*, Bing Xie, Sheng Di, Dong Dai,

IPDPS, 2024

Github Repo, Web App, Talk

We conduct a comparative study of multiple workloads in HPC and AI clusters. Based on the analysis, we have eight takeaways that can be used for designing better schedulers.

DGAP: Efficient Dynamic Graph Analysis on Persistent Memory

Abdullah Al Raqibul Islam*, Dong Dai,

SC, 2023

Github Repo

We propose a novel graph analysis framework, DGAP, to efficiently support dynamic graph analysis on Optane persistent memory. DGAP leverages PMA-based array and PMEM-specific optimizations to deliver better performance than state-of-the-art solutions.

A Reinforcement Learning Based Backfilling Strategy for HPC Batch Jobs

the original paper has been updated with minor corrections.

Elliot Kolker-Hicks*, Di Zhang*, Dong Dai,

PMBS@SC, 2023

Github Repo (code has been updated)

We show better job runtime prediction does not always lead to better backfilling, and propose to use reinforcement learning to learn an optimized backfilling strategy.

Early Exploration of Using ChatGPT for Log-based Anomaly Detection on Parallel File Systems Logs

Chris Egersdoerfer*, Di Zhang*, Dong Dai,

HPDC (poster), 2023

We explore to use ChatGPT for log-based anomaly detection on parallel file systems logs. It shows promising accuracy and understanding about the logs.


Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis.

Di Zhang*, Chris Egersdoerfer*, Tabassum Mahmud, Mai Zheng, Dong Dai,

IPDPS, 2023

Github Repo / talk

Drill is a state-of-the-art log-based anomaly detection system for large-scale storage systems using both content and context of the logs.


FaultyRank: A Graph-based Parallel File System Checker

Saisha Kamat*, Abdullah Al Raqibul Islam*, Mai Zheng, Dong Dai,

IPDPS, 2023

Github Repo / talk

FaultyRank is the first graph-based parallel file system checker that can detect and fix metadata inconsistencies and corruptions. It runs faster and more accurate than the state-of-the-art checker.

VCSR: Mutable CSR Graph Format Using Vertex-Centric Packed Memory Array

Abdullah Al Raqibul Islam*, Dazhao Cheng, Dong Dai,

CCGRID, 2022

Github Repo / talk

VCSR is a new mutable CSR graph format using packed memory array. Its new vertex-centric design enables fast graph updates and efficient graph traversal.


SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning

Di Zhang*, Dong Dai, Bing Xie,

HPDC, 2022

Github Repo / talk

SchedInspector opportunistically delays ready jobs to improve the overall performance of the existing job scheduling policies via reinforcement learning.

SentiLog: Anomaly Detecting on Parallel File Systems via Log-based Sentiment Analysis

Di Zhang*, Dong Dai, Runzhou Han, Mai Zheng,

HotStorage, 2021, Best Paper Nominee

Github Repo / talk

SentiLog proposes to use sentiment analysis to detect anomalies on parallel file systems.

RLScheduler: An AutomatedHPC Batch Job Scheduler Using Reinforcement Learning

Di Zhang*, Dong Dai, Youbiao He, Forrest Sheng Bao, Bing Xie,

SC, 2020

Github Repo / talk


RLScheduler uses reinforcement learning (PPO) to automaticlaly learn a batch job scheduler. It achieves the best flexibility, performance, and adaptability among all the schedulers.


A Performance Study of Optane Persistent Memory: From Indexing Data Structures’ Perspective

Abdullah Al Raqibul Islam*, Anirudh Narayanan*, Christopher York*, Dong Dai,

MSST, 2020

Github Repo 

In this paper, we systematically evaluated the performance of indexing data structures on Intel Optane persistent memory and obtained interesting observations.

Services

...(Full List in my CV)

Teaching


Fundings