First International Workshop on 

Managing and Querying Provenance Data at Scale

Held in conjunction with EDBT/ICDT 2013

March 22nd, 2013, Genova, Italy

Workshop program  [note: use "save as..." on the paper links for original quality paper downloads]

Session I: 9-10:30

Keynote I - Use of Real-Life EHR Data for Clinical Research and Personalized Medicine, Dr. Joerg Kraenzlein, CSC

Electronic Health Record (EHR) systems are a rich repository of medical knowledge that is largely untapped, especially in Clinical Research.  These vast data stores have the potential to improve patient recruitment, conduct feasibility studies, refine inclusion/exclusion criteria, enhance safety data and in general, to inform clinical research.  Key challenges in mining EHR data have revolved around a few key issues, including data privacy and security and a variety of data structures, both standard and non-standard, which can create inconsistencies. There are now many tools being developed and several important industry initiatives are under way.  Dr. Kraenzlein will review some of these important initiatives, he will present some case study information demonstrating the value of EHR data in Clinical Research, and will highlight the importance of collecting and analyzing provenance in support of EHR management.

Session II 11- 12:30

ProvBench short presentations:
  • Provenance Traces of the Swift Parallel Scripting SystemLuiz M. R. Gadelha Jr., Michael Wilde, Marta Mattoso, and Ian Foster
  • A Workflow PROV-Corpus based on Taverna and Wings, Khalid Belhajjame, Jun Zhao, Daniel Garijo, Aleix Garrido, Stian Soiland-Reyes, Pinar Alper, and Oscar Corcho

Session III 14- 15:30

    Keynote II -- Foundations and applications of Data Provenance, Dr. Grigoris Karvounarakis, LogicBlox, USA

    In recent years we have witnessed a boost in the publication and sharing of massive amounts of scientific, corporate, government and crowd-sourced data sets. In such settings data is often freely exchanged, integrated, and materialized through database queries, and knowing the sources and query operators involved in the derivation of data is crucial, in order to assess data quality and strengthen data accountability. This functionality essentially calls for representing and reasoning on the provenance of data derived through database queries. In this talk we are going to discuss how abstract provenance models can be employed to record information about source data and query operators during query evaluation, and later be used to assess various dimensions of data quality, such as trustworthiness, reputation and reliability, of query results. In particular, we are going to present such provenance models for positive relational algebra queries, and outline extensions for capturing a variety of other data models (XML, RDF) and query operators (recursion, negation, aggregation). We are also going to provide an overview of applications and systems that build upon these provenance models, and identify challenges involved in capturing and managing data provenance in Big Data settings.

    Session IV 16- 17:30

    • Using Provenance to Analyse Agent-based Simulations, Edoardo Pignotti, Gary Polhill, and Peter Edwards (short paper)
      • PROV-O Provenance Traces From Agent-based Social Simulation, Edoardo Pignotti, Gary Polhill, and Peter Edwards (ProvBench presentation) [slides]
    • ProvBench short presentations:

      • Provenance Traces from Chiron Parallel Workflow EngineFelipe Horta, Vítor Silva, Flavio Costa, Daniel de Oliveira, Kary Ocaña, Eduardo Ogasawara, Jonas Dias, and Marta Mattoso
      • Extracting PROV provenance traces from Wikipedia history pages, Paolo Missier and Ziyu Chen

          (open-ended discussion)

      Please see also the companion ProvBench site with the provenance corpora.

      Workshop Motivation and Focus.

      Provenance data is poised to become pervasive in key areas of information management, ranging from traditional areas of science (i.e., life sciences, earth sciences, astronomy, etc.), to new applications enabled by the Web (e.g., social sciences, social network analysis, quality and trust in Web publishing). 

      As the volume of provenance metadata increases with the volume of the underlying data whose history it describes, new challenges for managing and querying provenance at scale emerge, i.e., provenance data is growing in both "count" and "complexity"It is growing in count because of the very large number of provenance traces (one for each Twitter message, for example), and in complexity in the case of  provenance graphs that are generated from provenance-enabled programming environments (e.g., scientific workflow systems) and middleware. Data-intensive science is bound to produce provenance that fares high on both accounts.

      At the same time, emerging standards such as PROV, the W3C recommendation for provenance modelling and Web-based access, suggest that provenance data will increasingly be encoded using Semantic Web technology. This in turn suggests that provenance data will soon form a natural extension of, and seamlessly blend with, the growing Linked Data Cloud.  

      The new Managing and Querying Provenance Data at Scale workshop (BIGProv) stems from these premises. We are interested in exploring the system and modelling challenges associated with collecting, storing, querying, and exploiting large volumes of possibly complex provenance data. We seek to map the state of the art, elicit new research problems, and learn about existing systems. More specifically, the workshop scope includes the following topics: 

      • Automated capture of provenance at multiple layers (system, middleware, applications)
      • Database models, languages, and systems for storing and querying large-scale provenance 
      • Provenance and Linked Open Data (LOD): seamless representation and query models
      • Comparison and performance benchmarking of different data architectures and query models for provenance
      • Analysis of existing graph query models and systems for provenance graphs
      • Reference datasets for provenance benchmarking
      • System descriptions and demonstrations of large-scale provenance and graph data
      • Uniform querying over heterogeneous provenance traces
      • Abstraction models for provenance and their applications to user presentation, visualization, and privacy preservation

      Workshop Organizers and Contacts


      Bertram Ludaescher, UC Davis, CA (ludaesch@ucdavis.edu)
      Paolo Missier, Newcastle University, UK (pmissier@acm.org)

      Proceedings chair:  Victor Cuevas, University of New Mexico and UC Davis, USA

      Contact: bigprov13@easychair.org

      Program Committee members

      • Roger Barga, Microsoft Research, USA
      • Khalid Belhajjame, University of Manchester, UK
      • Edoardo Pignotti, University of Aberdeen, UK
      • Marta Mattoso, COPPE- Federal Univ. Rio de Janeiro, BR
      • Shawn Bowers, Gonzaga University, USA
      • Paul Groth, VU University Amsterdam, NL
      • Irini Fundulaki, ICS-FORTH, Greece
      • Paul Watson, Newcastle University, UK
      • Daniel Garijo, UPM, Spain
      • Ewa Deelman, USC Information Sciences Institute, USA
      • Luc Moreau, University of Southampton, UK