Home‎ > ‎

ProvBench @ Provenance Week 2014

ProvBench: Benchmarking Provenance Management Systems

2nd edition: Call for benchmarking datasets

Workshop Discussion

We had a discussion at the end of the workshop during which we sought feedback from the audience. You can find the minutes here.


Program

The program of ProvBench which will take place together with Provenance Reconstruction can be found here.


Background


Provenance metadata, or metadata that describes the origins of data, is now widely regarded as a key ingredient for numerous (traditional and novel) applications. For example, provenance can be used to inspect the quality of data provided by third-parties, to identify active members in social networks analytics, as well as to ensure correct attribution, citation and licensing.


The increasing number of provenance-related proposals and systems generates the need for a well documented and impartial provenance corpus that can be used by researchers and systems developers as a means for testing and validating their provenance management systems (ProvMS), including storage techniques for large provenance graphs, query models, and analysis algorithms. These systems are currently being tested and assessed on proprietary provenance datasets. This makes it difficult to benchmark and compare different implementations.


On the other hand, benchmark datasets are already available for a wide variety of generic DBMS, upon which many implementations of ProvMS are based. These generic systems include RDF triple stores, native graph DBMS, relational DBMS, and more. Thus, the questions we aim to answer include:

  1. Is there in fact a need for new benchmark datasets which are specific to provenance data and that reflect its usages? for instance: system-level provenance, provenance of web pages (MediaWiki), provenance of a SW project, provenance of scientific workflows, provenance of human processes, etc.

  2. Does provenance exhibit typical data or query patterns that may suggest ways to optimize either storage, or query processing?

  3. To what realistic sizes and at what rate does provenance data accumulate in different settings, and when does size begin to pose a problem to storage and query processing?

Objective

With these questions in mind, ProvBench looks to build upon the tradition of database benchmarks (e.g. relational, RDF). Its purpose is to collect a corpus of provenance datasets, along with associated query workloads that are at the same time:

  • broad: representative of a variety of provenance usage scenarios

  • specific to provenance data (as opposed to general RDF, graph, or relational benchmarking datasets)

  • challenging to provenance management systems (scalable storage, query performance)

Why do this?

You will not get a formal paper publication out of this, as we cannot include your documentation in the TAPP/IPAW proceedings. However you will get a data publication with an official DOI.


The datasets will be cited by members of the community who make use of them in their publications. To encourage this practice, the datasets accepted in ProvBench will be minted by DOIs, which will be allocated with the help of FigShare.  Furthermore, authors are encouraged to submit a companion paper to TAPP/IPAW.


Submissions

Submissions can be entirely new or they can be new versions, or refinements, of submissions to the first edition of ProvBench.


Each submission shall consist of:


  1. A dataset (provenance trace).

    1. Multiple distinct datasets can be submitted. These however should be “similar” provenance traces at differing scale, derived from the same original data source.

    2. Traces can be serialized in any of the W3C PROV encodings[1], either official (PROV Notation, PROV-O) or unofficial (PROV-XML, PROV-JSON)

  2. A query workload. Lacking a standard query language for provenance, queries are to be expressed in natural language and must be sufficiently precise to allow for unambiguous implementation.

  3. Metadata: size ( number or entities, activities, relationships), format, authorship, etc.

  4. Rationale and documentation for the submission, including:

    1. the type of scenario that the submission is representative of, along with any background info useful to understand the domain

    2. What can the dataset and its accompanying queries be used to test

    3. What makes the dataset distinct from generic DBMS benchmarks

    4. What makes the submission challenging

    5. How the dataset has been used to test specific properties of a ProvMS


Submission process
  • First, contributors should email Khalid Belhajjame (kbelhajj@googlemail.com), for expression of interest, with your Github account name. As a result, a project repository will be set up in the ProvBench Github area [4] for you with write access.

  • Then, submit your actual dataset to your Github repository.

  • The rationale document does not constitute a paper, and will not be published in a proceedings. Companion papers, if desired, should be submitted to TAPP[2] or IPAW[3].



The Event


The day of the event might take place as a mixture of presentations, mini-hackathon and panel sessions, depending on the number of submissions and number of participants. A detailed agenda will be announced a few weeks prior to the event.


ProvBench will be co-located with the Provenance week, taking place in Cologne (Germany), on the 13th of June, 2014.

Note that you have to register to ProvenanceWeek2014 in order to attend this event.


Important Dates


  • Expression of interest: May 2nd, 2014. (extended)

  • Submission deadline: May 9th, 2014.

  • Notification: May 19th, 2014.

  • Workshop day: June 13th, 2014

Organisers

References

[1]: http://www.w3.org/TR/prov-overview/

[2]: http://provenanceweek.dlr.de/tapp/call-participation/

[3]: http://provenanceweek.dlr.de/ipaw/call-participation/

[4]: https://github.com/provbench