A Tutorial on Provenance Analysis and RDF Query Processing (PARC)


Satya S. Sahoo {satya.sahoo@case.edu}, Praveen Rao {raopr@umkc.edu}
Case Western Reserve University, Cleveland,OH,USA
University of Missouri-Kansas City, MI USA

Abstract | Presenters | Motivation | Detailed Description | Relevance to ISWC Research Tracks and Expected Audience | TimeLine | Tutorial Material


Provenance metadata describes the origin or history of data and it is central to ensuring data quality and supporting scientific reproducibility with increasing role in emerging domains of healthcare informatics and the Internet of Things (IOT). Indeed provenance has a role in maintaining audit trails in financial transactions, complying with privacy laws, and facilitating secondary use of health data in research studies. In addition, provenance analytics can be used to support Web commerce, which generated $1 trillion worth of business in 2012, trust computations in social media and sensor networks. To address the growing interest in integration of provenance in information and knowledge management systems, this tutorial will weave together three related themes of: (1) role of provenance in supporting “data-driven” research and the Web of Data (Linked Open Data (LOD)) with 31 billion RDF triples; (2) the role of the World Wide Web Consortium (W3C) PROV specifications in modeling provenance information and scalable query processing techniques to support provenance analysis; and (3) real world applications of provenance in emerging disciplines of healthcare informatics.

The tutorial will be of interest to: (a) academic researchers who are incorporating provenance metadata in their research for data quality; (b) developers working on scalable platforms for emerging domain applications, such as IOT, LOD, and healthcare and life sciences. In addition to its meaningful breadth, the tutorial will present key technical topics that have seen significant research. These include, provenance modeling, querying and indexing techniques for W3C RDF datasets for provenance querying, and building complex provenance-enabled healthcare informatics platforms. The tutorial will cover the W3C PROV specifications, which are being used to integrate provenance in information systems, including the PROV Data Model (PROV-DM), PROV Ontology (PROV-O), and the PROV constraints. The tutorial developers have experience in provenance management and RDF database indexing and querying research.


Satya S. Sahoo is Assistant Professor in the Division of Medical Informatics and EECS department at the Case Western Reserve University (CWRU), Cleveland, OH USA. His research focuses on Semantic Web including: (1) provenance metadata management, (2) ontology engineering (upper-level reference ontologies to application/domain-specific ontologies), and (3) ontology-driven data integration and query optimization. Satya has served as member of W3C provenance working group that developed the PROV specifications and he was the co-editor of the PROV-O specification. Satya has received the Glennan Fellowship and Nord grant awards for teaching and education. He has presented tutorial at the International Conference on Health Informatics (ICHI) on use of Semantic Web technologies for biomedical research. His current research projects include the development of cloud-based "Big data" application for neuroscience clinical research (Cloudwave), Ontology-driven patient information capture (OPIC) system, and clinical text processing system called EpiDEA. Additional information about his research projects is available at: http://cci.case.edu/cci/index.php/satya_sahoo.

Praveen Rao is Associate Professor of Computer Science Electrical Engineering at University of Missouri-Kansas City (UMKC). He joined UMKC as an assistant professor in 2007. He is a collaborating faculty with the Center for Health Insights at UMKC. His research interests are in the areas of data management and health informatics. His research, teaching, and outreach activities have been supported by the National Science Foundation (NSF), University of Missouri Research Board, Intel Labs, Amazon Web Services, IBM, Headquarters Counseling Center (Kansas), and Kansas City Power and Light (KCP&L). He received the IBM Smarter Planet Faculty Innovation Award in 2010. In 2013, he was one of the 14 professors worldwide to receive the IBM Big Data and Analytics Faculty Award. He is a senior member of the IEEE. More information about Praveen is available at: http://r.web.umkc.edu/raopr


The growing role of big data in information systems requires efficient integration and analysis of large volume of data generated at high velocity with a variety of format and from different sources. In this context, the metadata of the datasets plays a critical role and the database research community has extensively used metadata for data integration [1, 2], data curation, and data quality. Provenance is a specific category of metadata that focuses on the history of data; hence the effective management of provenance has become critical in development of context aware information systems and knowledge management platforms. A variety of approaches have been used to address the challenges in provenance management, for example database provenance [3, 4], workflow provenance [5, 6], and use of Semantic Web technologies to model and query provenance information [7-9]. In many rapidly growing applications domains, such as healthcare informatics, social media, and sensor networks, the analysis of provenance is supporting not only data quality, but also is being used to underpin new features such as ranking of query results based on provenance metrics. However, effective provenance management in information systems requires standardization of representation techniques and efficient approaches for indexing and query processing provenance metadata.

The W3C PROV Specifications:  To address the issues of provenance modeling, W3C constituted the Provenance Working Group (PWG) in 2011 to define a standard for provenance interoperability across computing applications with focus on distributed information systems. The PWG defined the PROV group of specifications consisting of the data model (PROV-DM) [10], constraints on the data model [11], and the PROV ontology (PROV-O) [12] (member of the tutorial development team served as co-editor of PROV-O and as contributing author of PROV-DM). The W3C released the PROV specifications in April 2013. The PROV specifications are expected to significantly increase the adoption of provenance-enabled information systems in: (1) the “data-driven” research platforms, (2) trust computation in Web applications, and (3) data quality verification in the rapidly growing LOD. Provenance information modeled using the PROV specifications can be represented as W3C Resource Description Framework (RDF) graphs; therefore we explore the current state of the art in SPARQL query techniques over RDF graphs.

Indexing and Query Processing RDF:  There has been a flurry of interest within the database community to develop scalable techniques for indexing and query processing of large RDF datasets. In recent years, several techniques have been proposed for local RDF datasets containing triples (e.g., RDF-3X [20], Hexastore [27], BitMat [26], DB2RDF [22], TripleBit [23]). More recently, distributed RDF data stores in a cluster environment are emerging as a potential solution for managing large-scale RDF datasets (e.g., Trinity.RDF [25], H2RDF+ [28], TriAD [30], DREAM [27]). For provenance applications, RDF triples containing context information, i.e., RDF quadruples, will play an important role. The context of a quadruple can describe the origin of the triple. Efficient query processing methods for RDF quadruples are highly desirable. This tutorial will provide an overview of existing RDF query processing approaches for datasets containing triples and quadruples, their strengths and weaknesses for provenance analytics, and open problems in this area.

Detailed Description

PART I: Provenance, W3C PROV, and Data Quality

Provenance is derived from the French word “provenir”, which means “to come from” and is often described by the W7 model involving Why, Who, When, Where, Which, Who, How [13]. In the first part of this tutorial, we will describe the existing work in database provenance, workflow provenance, and the three aspects of provenance modeling: (a) PROV Data Model (PROV-DM), and (b) PROV Ontology (PROV-O), and (c) PROV constraints that support inference over PROV graphs. The tutorial will review the core set of PROV terms, the constructs defined in the PROV-O using OWL [14]. In addition, the PROV Constraints will be reviewed to discuss the rule-based techniques to validate the quality of provenance information modeled using the PROV specifications.

A key component of provenance management is querying “provenance graphs” to support data quality and trust computations. Provenance queries can be divided into three major categories: (1) Retrieve provenance of an information entity, (2) Retrieve information entities satisfying provenance qualities, and (3) Operations to identify difference in between provenance of two information entities or merging provenance trails of two entities [15]. Implementation of these provenance queries over large scale graph data requires complex graph operations, including multi-hop graph traversals, sub graph identification and retrieval, and graph isomorphism operations. We briefly cover the challenges associated with supporting these graph operations for provenance analysis.

PART II: Scalable RDF Query Processing: Current Practices for Provenance Analytics

In second part of the tutorial, we present an overview of existing query processing techniques for large RDF datasets available on the Web. Early approaches employed an RDBMS to store and query RDF data (e.g., Sesame, Oracle). Unfortunately, the cost of self-joins on a single (triples) table became a serious bottleneck. The tutorial will review the various approaches proposed to address these issues, including vertical partitioning the property tables and used of column-oriented DBMS to achieve an order of magnitude performance improvement over previous techniques. Recent efforts have developed DB2RDF that use a RDBMS to store and query RDF data. A few approaches exploit the graph properties of RDF data for indexing and query processing. These techniques, however, have been tested only on small RDF datasets containing less than 50 million triples (e.g., gStore [21]).

New approaches have proposed use of distributed and parallel RDF query processing, such as using a parallel SPARQL query processing approach by partitioning graphs on vertices and placing triples on different machines. Using n-hop replication of triples in partitions, they avoid communication between partitions during query processing. The Trinity.RDF where RDF graphs are stored natively using Trinity, a distributed in-memory key-value store. Using graph exploration and novel optimization techniques, the size of intermediate results is reduced leading to faster query execution. More recently, H2RDF+ was proposed and builds eight indexes using HBase. It uses Hadoop to perform sort-merge joins during query processing. TriAD uses asynchronous inter-node communication for scalable SPARQL query processing. It outperforms distributed RDF query engines that rely on Hadoop to perform joins during query processing. The tutorial will explore the outstanding issues in large-scale RDF query processing to address the growing need for provenance in many domain-specific applications [16-18].

PART III: Example Platforms and Applications: Healthcare Informatics

The final part of the tutorial will be divided into two sections: (1) Existing tools and platforms for provenance management, including tools supporting the W3C PROV specifications that can be used by application developers and researchers; and (2) Example of applying provenance in the emerging domain of healthcare informatics. After the release of the W3C PROV in 2013, there has been extensive work in development of software and platforms to support provenance analysis using the PROV specifications (http://www.w3.org/2001/sw/wiki/PROV). There are also multiple tools supporting different programming languages (e.g., Java, Python) that can be used in information systems and data integration tools to support provenance management.

Using ongoing large healthcare informatics projects as real world examples, the tutorial will explore the use of provenance in ensuring data quality in large healthcare data repositories. The National Sleep Research Resource (NSRR) is a large multi-institute healthcare informatics project funded by the US National Institutes of Health (NIH) to create one of the largest repositories of multi-modal data that aims to provide unique data-driven opportunities to healthcare researchers using new cloud computing and Big Data techniques. NSRR proposed to use provenance information associated with research studies to ensure data quality and support complex data analysis operations. The tutorial will describe the use of provenance in NSRR with practical scenarios to illustrate the use of provenance in real world data analytics projects.

Relevance to ISWC Research Tracks and Expected Audience

The themes covered in the tutorial are directly related with the focus areas of this conference in Knowledge Modeling and Databases, especially Semantic Techniques, Knowledge management in specific domains (e.g. healthcare informatics), Query processing, optimization, and performance. The tutorial will be of interest to the database, Semantic Web, and knowledge modeling research community in general and researchers in various applications domains in particular who are working to integrate provenance as a core component of their applications. The tutorial will be accessible to participants who are new to Semantic Web as well as researchers who are already familiar with Semantic Web technologies, but are new to provenance. The tutorial will allow the audience to gain first- hand knowledge of the W3C PROV specifications, query processing over large scale provenance RDF graphs, and the use of provenance in real world application domains.


July 1, 2015: Tutorial website will be live

August 15, 2015: Details of the tutorial schedule

September 15, 2015: Tutorial material, including slides will be put on the tutorial website

October 11-12, 2015: Conference and tutorial

Tutorial Material

Tutorial presentation is available here.


Praveen Rao was supported by the National Science Foundation under Grant No. 1115871. Satya Sahoo is funded by the the National Institutes of Biomedical Imaging and Bioengineering (NIBIB) grant (1U01EB020955) .


[1] A. P. Sheth, Larson, J.A., "Federated database systems for managing distributed, heterogeneous and autonomous databases," ACM Computing Surveys, vol. 22, pp. 183--236, 1990.

[2] A. Halevy, Rajaraman, A., Ordille, J.J., "Data Integration: The Teenage Years.," in Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), Seoul, Korea, 2006, pp. 9-16.

[3] P. a. T. Buneman, W., "Provenance in databases. ," in ACM SIGMOD international Conference on Management of Data, Beijing, China, 2007., pp. 1171-1173.

[4] J. Widom, "Trio: A System for Data, Uncertainty, and Lineage," in Managing and Mining Uncertain Data, C. Aggarwal, Ed., ed: Springer, 2008.

[5] P. Missier, Sahoo, S.S., Zhao, J., Goble, C., Sheth, A., "Janus: from Workflows to Semantic Provenance and Linked Open Data," presented at the IPAW 2010, Troy, NY, 2010.

[6] J. Zhao, Goble, C., Stevens, R., Turi, D. , "Mining taverna's semantic web of provenance," Journal of Concurrency and Computation:Practice and Experience, 2007.

[7] S. S. Sahoo, Sheth, A., Henson, C., "Semantic Provenance for eScience: Managing the Deluge of Scientific Data," IEEE Internet Computing, vol. 12, pp. 46-54, 2008.

[8] P. Missier, Soiland-Reyes, S., Owen, S., Tan, W., Nenadic, A., Dunlop, I., Williams, A., Oinn, T., Goble, C., "Taverna, reloaded," in 22nd international conference on Scientific and statistical database management (SSDBM'10), Heidelberg, 2010, pp. 471-481.

[9] H. Patni, Sahoo, S.S., Henson, C., Sheth, A., "Provenance Aware Linked Sensor Data," presented at the 2nd Workshop on Trust and Privacy on the Social and Semantic Web, Co-located with ESWC2010, Heraklion Greece, 2010.

[10] L. Moreau, Missier, P., "PROV Data Model (PROV-DM)," World Wide Web Consortium W3C2013.

[11] J. Cheney, Missier, P., Moreau, L., " Constraints of the PROV Data Model," World Wide Web Consortium W3C2013.

[12] T. Lebo, Sahoo, S.S., McGuinness, D., "PROV-O: The PROV Ontology," World Wide Web Consortium W3C2013.

[13] C. Goble, "Position Statement: Musings on Provenance, Workflow and (Semantic Web) Annotations for Bioinformatics," in Workshop on Data Derivation and Provenance, Chicago, 2002.

[14] P. Hitzler, Krötzsch, M., Parsia, B., Patel-Schneider, P.F., Rudolph, S., "OWL 2 Web Ontology Language Primer," World Wide Web Consortium W3C2009.

[15] S. S. Sahoo, Nguyen, V., Bodenreider, O., Parikh, P., Minning, T., Sheth, A.P., "A unified framework for managing provenance information in translational research.," BMC Bioinformatics, vol. 12, 2011.

[16] P. Rao, Moon, B., "Locating XML Documents in a Peer-to-Peer Network Using Distributed Hash Tables," IEEE Transactions on Knowledge and Data Engineering, (TKDE), vol. 21, pp. 1737-1752, 2009.

[17] D. Pal, Rao, P., "A Tool For Fast Indexing and Querying of Graphs," in Proceedings of 20th International World Wide Web Conference (WWW 2011), Hyderabad, India, 2011, pp. 241-244.

[18] V. Slavov, Katib, A., Rao, P., "A Tool for Internet-Scale Cardinality Estimation of XPath Queries over Distributed Semistructured Data," presented at the Proceedings of the 30th IEEE International Conference on Data Engineering (ICDE 2014), Chicago, 2014.

[19] D. J. Abadi, A. Marcus, S. R. Madden, K. Hollenbach, “Scalable Semantic Web Data Management Using Vertical Partitioning,” in Proc. of the 33rd VLDB Conference, 2007, pp. 411-422.

[20] T. Neumann, G. Weikum, “RDF-3X: a RISC-style engine for RDF,” in Proceedings of the VLDB Endowment 1 (1) (2008) 647-659.

[21] L. Zou, J. Mo, L. Chen, M. T. Ozsu, D. Zhao, “gStore: Answering SPARQL queries via subgraph matching,” Proc. VLDB Endow. 4 (2011) 482-493.

[22] M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, B. Bhattacharjee, “Building an efficient RDF store over a relational database,” in Proc. of 2013 SIGMOD Conference, 2013, pp. 121-132.

[23] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, L. Liu, “TripleBit: A fast and compact system for large scale RDF data,” in Proc. VLDB Endow. 6 (7) (2013) 517-528.

[24] J. Huang, D. J. Abadi, K. Ren, “Scalable SPARQL querying of large RDF graphs,” in Proc. of VLDB Endow. 4 (11) (2011) 1123-1134.

[25] K. Zeng, J. Yang, H. Wang, B. Shao, Z. Wang, “A distributed graph engine for Web Scale RDF data,” Proc. VLDB Endow. 6 (4) (2013) 265-276.

[26] M. Atre, V. Chaoji, M. J. Zaki, J. A. Hendler, “Matrix "Bit" loaded: A scalable lightweight join query processor for RDF data,” in Proc. of the 19th WWW Conference, 2010, pp. 41-50.

[27] C. Weiss, P. Karras, A. Bernstein, “Hexastore: Sextuple indexing for Semantic Web data management,” in Proc. VLDB Endow. 1 (1) (2008) 1008-1019.

[28] M. Hammoud, D. A. Rabbou, R. Nouri, S.M.R. Beheshti, S. Sakr, “DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication,” Proc. VLDB Endow. 8 (6) (2015) 654-665.

[29] N. Papailiou, D. Tsoumakos, I. Konstantinou, P. Karras, N. Koziris, “H2RDF+: An Efficient Data Management System for Big RDF Graphs,” in Proc. of the 2014 SIGMOD Conference, Snowbird, Utah, 2014, pp. 909-912.

[30] S. Gurajada, S. Seufert, I. Miliaraki, M. Theobald, “TriAD: A Distributed Shared-nothing RDF Engine Based on Asynchronous Asynchronous Message Passing,” in Proc. of the 2014 SIGMOD Conference, Snowbird, Utah, 2014, pp. 289-300.

[31] Vasil Slavov, Anas KatibPraveen Rao, Srivenu Paturi, Dinesh Barenkala - Fast Processing of SPARQL Queries on RDF Quadruples. In the Proceedings of the 17th International Workshop on the Web and Databases (WebDB 2014), Snowbird, UT, 2014.