Shared Provenance Representations:
Cross-Organization Reproducibility with Semantic Workflows
Information Sciences Institute, University of Southern California
December 9, 2010
At a recent NSF workshop, attendees expressed great concern that today reproducibility is virtually impossible for complex scientific applications and recommended viewing workflows as a key component of science infrastructure that facilitates reproducibility and validation [Gil et al 07]. A variety of workflow systems have been developed and are being used in many areas of science.
Our work in Wings focuses on semantic workflows [Gil et al 10; Kim et al 08], which are descriptions of workflows at a higher abstraction level that capture what workflow steps do rather than how they are implemented or executed. We have developed semantic workflow representations that support automatic constraint propagation and reasoning algorithms to manage constraints among the individual workflow steps. In recent work (currently in submission), we reproduced results published in the literature by reusing workflows from a library that captured a wide range of methods that are common in population genomics. Some observations from these studies include:
· A library of carefully crafted workflows of select state-of-the-art methods could cover a very large range of analyses in many scientific areas. The workflows that we used to replicate the results were independently developed and unchanged. They were designed with no notion of the replicated studies, which were chosen long afterwards.
· Workflow systems enable efficient set up of analyses. The collection of workflows was available through an interface that described the workflow components and the kinds of data required for each workflow, and had a workflow system for automatic execution management in a set of grid resources. The replication studies took seconds to set up. There was no overhead incurred in downloading or setting up software tools, or reading documentation, or typing commands to execute each of the steps of the analysis.
· It is important to create semantic abstractions of the conceptual analysis being carried out that factor out details of the execution environment. The software components used in the original studies were not the same than those in our workflows. Our workflows are described in an abstract fashion, independent of the specific software components executed. In one of the original studies, the authors used proprietary software, while our workflow used a collection of open source software tools that implemented the same method. In another study an older algorithm was used, while our workflow used a state-of-the-art method that combines evidence from three newer algorithms. Our workflows thus contain more prescient methods that can be readily applied.
· Semantic constraints can be added to workflows to avoid analysis errors in results replication. One of the workflows that we set up had failed to execute. Examining the trace we saw that the failure occurred while executing one of the last components. Reading the documentation of the software we realized that the method does not work if the data contains duplicate individuals. Upon manual examination we discovered that there were three duplicated individuals in the dataset. We removed them by hand and the workflow executed with no problems. The workflow now includes a semantic constraint that the input data for the association test cannot contain duplicate individuals, which results in the prior step in the workflow having a parameter set to remove duplicates. Equipped with such semantic constraints, the workflow system can automatically determine the correct set up of the analysis for a given dataset. The time savings to future users could be significant.
A workflow can be seen as a “digital instrument” that enables scientists to analyze data through the lens of the method that the workflow represents. Shared workflow repositories would give scientists access to such digital instruments at very low cost. Semantic workflow representations can capture semantic constraints that must be respected in order to use the suite of digital instruments properly. Semantic workflow representations can also be used to establish mappings across different experiments, facilitating the organization of valuable provenance knowledge about scientific results.
[Cheney et al 10] “Requirements for Provenance on the Web.” James Cheney, Yolanda Gil, Paul Groth (Editor), and Simon Miles. Report from the W3C Provenance Incubator Group, first release: April 9, 2010. Available from http://www.w3.org/2005/Incubator/prov/wiki/User_Requirements
[Gil et al 07] “Examining the Challenges of Scientific Workflows.” Yolanda Gil, Ewa Deelman, Mark Ellisman, Thomas Fahringer, Geoffrey Fox, Dennis Gannon, Carole Goble, Miron Livny, Luc Moreau, and Jim Myers. IEEE Computer, vol. 40, no. 12, December, 2007. Available from http://www.isi.edu/~gil/papers/computer-NSFworkflows07.pdf
[Gil et al 11] “Wings: Intelligent Workflow-Based Design of Computational Experiments.” Yolanda Gil, Varun Ratnakar, Jihie Kim, Pedro Antonio Gonzalez-Calero, Paul Groth, Joshua Moody, and Ewa Deelman. To appear in IEEE Intelligent Systems, 2011. Available from http://www.isi.edu/~gil/papers/gil-etal-ieee-is-11.pdf
[Gil et al 10] “Final Report of the W3C Provenance Incubator Group.” Yolanda Gil, James Cheney, Paul Groth, Olaf Hartig, Simon Miles, Luc Moreau, and Paolo Pinheiro da Silva. Report from the W3C Provenance Incubator Group, first release: November 30, 2010. Available from http://www.w3.org/2005/Incubator/prov/wiki/Final_Report_Draft
[Kim et al 08] “Provenance Trails in the Wings/Pegasus Workflow System.” Jihie Kim, Ewa Deelman, Yolanda Gil, Gaurang Mehta, Varun Ratnakar. Concurrency and Computation: Practice and Experience, Vol 20, Issue 5, April 2008. Available from http://www.isi.edu/~gil/papers/CCPE07-Provenance.pdf
[Sahoo et al 10] “Provenance Vocabulary Mappings.” Satya Sahoo, Paul Groth, Olaf Hartig, Simon Miles, Sam Coppens, James Myers, Yolanda Gil, Luc Moreau, Jun Zhao, Michael Panzer, and Daniel Garijo. Report from the W3C Provenance Incubator Group, first release: August 6, 2010. Available from http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings