ProvBench: A Provenance Repository for Benchmarking


Provenance metadata, or metadata that describes the origins of the data, is now widely regarded as a key ingredient for numerous (traditional and novel) applications. For example, provenance can be used to inspect the quality of data provided by third-parties, to identify active members in social networks analytics, as well as to ensure correct attribution, citation and licensing. 

The increasing number of provenance-related proposals and systems has generated the need for a well documented and impartial provenance corpus that can be used by researchers and systems developers as a means for testing and validating their provenance management systems, including storage techniques for large provenance graphs, query models, and analysis algorithms. These systems are currently being tested and assessed on proprietary provenance datasets. This makes it difficult to benchmark and compare different implementations.

General Objective of ProvBench

The objective of ProvBench is to bootstrap the publication of provenance information in an open accessible way. It aims to lead to the collection of a corpus of reference provenance traces from multiple contributors, from multiple domains, with different sizes and structures, and make it available as a large and ever growing community resource.

Such corpus can be used for a variety of purposes: to understand current usages of provenance, to assess system performance in terms of storage and processing time and interoperability, as well as expressiveness, querying, inference and constraint enforcement capabilities.

At this early stage we plan to organise the submission to the benchmark repository by co-locating it with series of relevant events. This format is subject to changes according to the growth of interests and the development of the community.

Target Audience

We have a broad scope for targeted audience who will contribute to the creation of the provenance corpus or make use of this corpus for benchmarking. Anyone is invited to contribute provenance information to the corpus, no matter being a content provider or a technologist. A content provider can be the owner of a web site or a blog, scientists or data providers of a dataset of any format, be it relational, structured, or non-structured. A technologist can be a developer of a provenance publication application or plug-in, an administrator of a web site, or computer scientists who are interested in provenance-related research or applications, or just generally enthusiastic about provenance.