Catamaran Resort, San Diego, CA
2:00pm - 2:10pm
Jean Luca Bez, LBNL
2:10pm - 3:00pm
Christine Kirkpatrick, San Diego Supercomputer Center
The rapid adoption of LLMs and Gen-AI content creators has driven interest to accelerate the use of Deep Learning in research. One of the gaps has been the availability of models built on high quality, contextualized data. The idea of “AI-ready” data has become a common aspiration across research domains. The research data management community has been steadily creating and applying data standards and interoperability concepts in service of science, the last decade coalesced around the FAIR (Findable, Accessible, Interoperable, Reusable) Principles. Is FAIR synonymous with Fully AI ready, or is it more myth than yet-to-be-unearthed truth? This talk examines the promises and limitations of applying the FAIR principles to data, definitions of AI readiness, and emerging approaches to assert suitability in deep learning contexts. Signals and updates from the NSF-funded research coordination network FARR (FAIR in Machine Learning, AI Readiness, AI Reproducibility) will be shared, exploring related areas and challenges at the intersection of AI and Open Science.
3:00pm - 3:30pm
Fernanda Foertter, University of Alabama
Data often exists in formats or locations that make it difficult to leverage for advanced analytics or AI applications. And because they lie hidden, it’s difficult to tell if certain data hold value. The first part of this talk will walk through research done for the NIH defining the value of data. The second part will cover tools and frameworks that can be used to convert raw, inaccessible, or unstructured data into high-quality, AI-ready datasets. Leveraging techniques such as schema inference, automated feature extraction, and knowledge graph construction, this talk will demonstrate how ML/AI models can be trained to not only process but also understand hidden datasets. The research will highlight the importance of data context, interoperability, and metadata generation to open up and reveal untapped value from dark data.
3:30pm - 4:00pm
4:00pm - 4:30pm
Wesley Brewer, Patrick Widener, Valentine Anantharaj, Feiyi Wang, Tom Beck, Arjun Shankar, Sarp Oral
This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains—climate, nuclear fusion, bio/health, and materials—to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework helps outline key challenges in transforming large-scale scientific data into formats suitable for scalable AI training. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.
4:30pm - 5:00pm
Olivera Kotevska, ORNL
The ability to analyze data responsibly while preserving privacy is a foundational requirement for trustworthy AI. This talk explores PETINA and PRESTO, two ORNL-developed tools designed to make differential privacy practical and adaptive. PETINA provides the core capabilities for performing private data analysis, while PRESTO serves as an intelligent recommendation engine, guiding users toward optimal privacy-preserving configurations based on dataset features and privacy-utility trade-offs. By integrating automated optimization into privacy workflows. Together, these tools provide a practical pathway for embedding differential privacy into AI systems without compromising on usability or performance.
5:00pm - 5:30pm
Shubhabrata Mukherjee, Jack Lang, Obeen Kwon, Iryna Zenyuk, Valerie Brogden, Adam Weber, Daniela Ushizima
Zero-shot and prompt-based models have excelled at visual reasoning tasks by leveraging large-scale natural image corpora, but they often fail on sparse and domain-specific scientific image data. We introduce Zenesis, a no-code interactive computer vision platform designed to reduce data readiness bottlenecks in scientific imaging workflows. Zenesis integrates lightweight multimodal adaptation for zero-shot inference on raw scientific data, human-in-the-loop refinement, and heuristic-based temporal enhancement. We validate our approach on Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) datasets of catalyst-loaded membranes. Zenesis outperforms baselines, achieving an average accuracy of 0.947, Intersection over Union (IoU) of 0.858, and Dice score of 0.923 on amorphous catalyst samples; and 0.987 accuracy, 0.857 IoU, and 0.923 Dice on crystalline samples. These results represent a significant performance gain over conventional methods such as Otsu thresholding and standalone models like the Segment Anything Model (SAM). Zenesis enables effective image segmentation in domains where annotated datasets are limited, offering a scalable solution for scientific discovery.
5:30 - 5:35
Jean Luca Bez, LBNL
DRAI 2025 | Data Readiness for AI (DRAI) Workshop | Bez, Byna