September 8, 2025

Data Readiness for AI (DRAI) Workshop

held in conjunction with ICPP'25

Catamaran Resort, San Diego, CA

Schedule

2:00pm - 2:10pm

DRAI

Kickoff for DRAI'25

Jean Luca Bez, LBNL

Download Slides

2:10pm - 3:00pm

Keynote

El Dorado or Troy: Demystifying FAIR and AI Ready Data

Christine Kirkpatrick, San Diego Supercomputer Center

The rapid adoption of LLMs and Gen-AI content creators has driven interest to accelerate the use of Deep Learning in research. One of the gaps has been the availability of models built on high quality, contextualized data. The idea of “AI-ready” data has become a common aspiration across research domains. The research data management community has been steadily creating and applying data standards and interoperability concepts in service of science, the last decade coalesced around the FAIR (Findable, Accessible, Interoperable, Reusable) Principles. Is FAIR synonymous with Fully AI ready, or is it more myth than yet-to-be-unearthed truth? This talk examines the promises and limitations of applying the FAIR principles to data, definitions of AI readiness, and emerging approaches to assert suitability in deep learning contexts. Signals and updates from the NSF-funded research coordination network FARR (FAIR in Machine Learning, AI Readiness, AI Reproducibility) will be shared, exploring related areas and challenges at the intersection of AI and Open Science.

Download Slides

3:00pm - 3:30pm

Invited Talk

What makes data valuable and how can AI help?

Fernanda Foertter, University of Alabama

Data often exists in formats or locations that make it difficult to leverage for advanced analytics or AI applications. And because they lie hidden, it’s difficult to tell if certain data hold value. The first part of this talk will walk through research done for the NIH defining the value of data. The second part will cover tools and frameworks that can be used to convert raw, inaccessible, or unstructured data into high-quality, AI-ready datasets. Leveraging techniques such as schema inference, automated feature extraction, and knowledge graph construction, this talk will demonstrate how ML/AI models can be trained to not only process but also understand hidden datasets. The research will highlight the importance of data context, interoperability, and metadata generation to open up and reveal untapped value from dark data.

Download Slides

3:30pm - 4:00pm

DRAI

Coffee Break

4:00pm - 4:30pm

Research Paper

Data Readiness for Scientific AI at Scale

Wesley Brewer, Patrick Widener, Valentine Anantharaj, Feiyi Wang, Tom Beck, Arjun Shankar, Sarp Oral

This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains—climate, nuclear fusion, bio/health, and materials—to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework helps outline key challenges in transforming large-scale scientific data into formats suitable for scalable AI training. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.

Download Slides

4:30pm - 5:00pm

Invited Talk

Enabling Trustworthy AI: Differential Privacy and Secure Computation with PETINA and PRESTO

Olivera Kotevska, ORNL

The ability to analyze data responsibly while preserving privacy is a foundational requirement for trustworthy AI. This talk explores PETINA and PRESTO, two ORNL-developed tools designed to make differential privacy practical and adaptive. PETINA provides the core capabilities for performing private data analysis, while PRESTO serves as an intelligent recommendation engine, guiding users toward optimal privacy-preserving configurations based on dataset features and privacy-utility trade-offs. By integrating automated optimization into privacy workflows. Together, these tools provide a practical pathway for embedding differential privacy into AI systems without compromising on usability or performance.

Download Slides

5:00pm - 5:30pm

Research Paper

Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data

Shubhabrata Mukherjee, Jack Lang, Obeen Kwon, Iryna Zenyuk, Valerie Brogden, Adam Weber, Daniela Ushizima

Zero-shot and prompt-based models have excelled at visual reasoning tasks by leveraging large-scale natural image corpora, but they often fail on sparse and domain-specific scientific image data. We introduce Zenesis, a no-code interactive computer vision platform designed to reduce data readiness bottlenecks in scientific imaging workflows. Zenesis integrates lightweight multimodal adaptation for zero-shot inference on raw scientific data, human-in-the-loop refinement, and heuristic-based temporal enhancement. We validate our approach on Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) datasets of catalyst-loaded membranes. Zenesis outperforms baselines, achieving an average accuracy of 0.947, Intersection over Union (IoU) of 0.858, and Dice score of 0.923 on amorphous catalyst samples; and 0.987 accuracy, 0.857 IoU, and 0.923 Dice on crystalline samples. These results represent a significant performance gain over conventional methods such as Otsu thresholding and standalone models like the Segment Anything Model (SAM). Zenesis enables effective image segmentation in domains where annotated datasets are limited, offering a scalable solution for scientific discovery.

Download Slides

5:30 - 5:35

DRAI

Closing Remarks

Jean Luca Bez, LBNL

DRAI 2025 | Data Readiness for AI (DRAI) Workshop | Bez, Byna

Page updated

Report abuse