3rd Workshop on Reproducible Workflows, Data Management, and Security

During eScience'23 in Limassol, Cyprus

Tuesday October 10, 2023

Emerging and future computational workloads are combining traditional HPC applications with tools and techniques from the scale-out data analytics and machine learning community. Getting these technologies to co-exist and interoperate to advance scientific discovery is a daunting task with few known good solutions. In general, constructing these workflows has the potential to create pitfalls and incompatibilities that limit adoption.

Formalizing the steps necessary for an application or data processing pipeline is increasingly popular and necessary. Requirements for reproducibility artifacts for publishing venues are also driving this formalization. The processes and infrastructure to accomplish these requirements are frequently bespoke or custom for a particular research area. All of these formalization activities can be described as workflow systems. Existing off-the-shelf tools address a distributed environment fairly well, but are not complete solutions and do not address the scale-up community much, if at all.

Complicating managing workflows are the tasks of managing data both during workflow execution and then afterward as well as offering authentication and data security for shared data sets. With some data, such as climate simulation output, being subject to intense scrutiny, it becomes crucial to offer open data that can be verified as authentic by means of encrypted creator identities, and accessible only to people with a need to know. Sharing and analyzing the data knowing it is authentic while protecting the privacy of the creators is essential for reliable open science while protecting the identity of the scientists performing the work.

This workshop seeks to explore ideas and experiences on what kinds of infrastructure developments can improve upon the state of the art. Explorations of component packaging via containers and virtual machines, automation scripting, deployment,  portability builds, and system support for these and other relevant activities are key infrastructure. Provenance collection, exploration, and tracking are key for a well-documented scientific output. Using existing systems to achieve these goals via experiences is important for developing best practices that span application domains. Data privacy techniques such as multi-party encryption and differential privacy are important as well. Issues with managing large data sets and workflow intermediate data, particularly those intended to manage publicly accessed data for use and reuse are encouraged. New techniques and technologies that address reproducibility requirements are also requested. We seek work on all of these, and related, topics as well as position and experience papers looking to drive the conversation for practitioners and researchers in these spaces.

This workshop contributes by sharing experiences and exploring the various technological infrastructure needs to support effective, convenient workflow systems and application composition structures and approaches across a broad spectrum of HPC environments from clusters to supercomputers to cloud systems. 


Abstract: In this talk, I will explore the concept of semantic code exploration, where AI, knowledge graphs, and deep learning blend to revolutionize how we discover and utilize scientific software repositories. Later this talk will delve into the world of serverless frameworks powered by AI, introducing innovative techniques such as semantic workflow code search, code summarization, and workflow code completion. These advancements simplify data-intensive computations, providing researchers and practitioners with powerful tools to accelerate their work. I will finish the talk by presenting novel techniques like auto-scaling, bridging the gap in efficiency, and supporting stateful streaming applications within scientific workflows. These innovations collectively reshape the landscape of scientific research, fostering innovation and collaboration, and ultimately driving efficiency to advance scientific research.

Abstract: For science to support new discoveries reliably, its results must be reproducible. Reproducibility is getting consistent results when using the same or similar input data, computational steps, conditions of analysis, etc. Obstacles to reproducibility are, often, the use of randomization and asynchronous distributed computation, but also evolving libraries and packages on which the program depends. Provenance is a record of the data creation process and of dependencies between data. Provenance is crucial to trace errors and ensure repeatability in data science and workflows. However, in current systems, provenance is often audited but often not used for establishing reproducibility. This talk will describe reproducible containers which use provenance for guaranteeing computational reproducibility. First, we'll explain how to create reproducible containers and why it's essential to combine their file isolation feature with provenance-based record and replay. Next, we will demonstrate how these containers can use provenance to establish reproducibility despite varying notebook platforms and kernels. Finally, we will show how provenance can be used within the containers to accomplish a variety of reproducible analyses, such as differencing, multi-version replay, and data debloating. We'll demonstrate the significance of provenance in various phases of computational reproducibility through these goals.


Agenda: ReWorDS 23

Schedule

Photo of our previous workshop during eScience'22 in Salt Lake City, Utah, USA. 

Topics of Interest:

Photos from Limassol, Cyprus

Submissions accepted in EasyChair:

https://easychair.org/my/conference?conf=rewords23 

Papers should be formatted in IEEE format following eScience formatting rules and can be 6 pages not including references.

Important Dates:

Proposed Program Committee:

Organizing Committee: