2nd Provenance-based Security Workshop


Today, stakeholders are typically unable to understand the chain of events that cause certain outcomes in complex systems. For example, in computer systems, it is typically difficult to understand why certain computations are run, or why certain artifacts such as files are modified, deleted, or exfiltrated. This lack of transparency often results in a lack of situational awareness on the part of stakeholders responsible for assuring confidentiality, integrity, or availability of such systems. Such lack of awareness makes it easy for cyber-criminals to establish and persist presence in such systems, and leverage that persistence to perform actions detrimental to system goals. Current tools for securing complex systems, such as event logs and audit trails, can offer some partial information on temporally and spatially localized events as seen from the viewpoint of individual applications, but generally do not help with detection of such persistent threats. From temporally or spatially local views, actions of such persistent adversaries typically do not appear suspicious.

One emerging way to detect persistent threats in complex systems is to construct graphs of provenance for system data and control flows, and then explore those graphs to identify global or long-term patterns of causality that differ from behaviors commonly expected or observed. Divergence from expectations may be identified by first formulating expectations based on human domain expertise, and then comparing observations against such patterns. Divergence from common observations may be identified by formulating automated detectors for anomalous causality patterns. Other techniques for identifying divergence of observed behavior from "normal" behavior may apply as well.

The 2nd Provenance-based Security Workshop aims to bring together researchers in the provenance and security domains to

  • share recent results in practical or emerging theoretical provenance-based detection of suspicious behavior patterns in complex systems
  • identify open problems in provenance analysis that relates to shortfalls currently found in those results
  • share resources that may foster further research in addressing these open problems
  • and build collaborative relationship opportunities to work together in the future on those problems


  • Dr. David Archer, Galois, Inc. (dwa@galois.com)
  • Dr. Thomas Pasquier, University of Cambridge (tfjmp2@cam.ac.uk)
  • Dr. Thomas Moyer, UNC Charlotte (tom.moyer@uncc.edu)

Workshop Format

  • up to 3 hours: a series of report-outs on recent results, emphasizing (1) novel approaches in identifying unexpected or unusual patterns in causality that correspond to suspicious intent or behavior, and (2) experimental successes and failures in finding such patterns, measured against realistic or real behavior / sensor data
  • 1 hour: an ideation session to surface and characterize open problems that are recurring themes in the above report-outs
  • 1 hour: one or more working sessions to develop research agendas and identify useful data resources to address those problems
  • 45 minutes: a group report-out and identification of next steps and means to address those agendas

Submission Procedures

Please submit plain text abstracts of about half of one page to the workshop organizers at prov-based-sec18@galois.com. Multiple submissions for different experiences and/or requirements are welcome.


  • 1630 April 2018 : Deadline for abstract submissions
  • 14 May 2018 : Program published and speaker notifications
  • 29 May 2018 : Expression of interest to attend
  • 1 July 2018 : Slides due to organizers
  • 13 July 2018 : Workshop


In order to register, please fill out the Google form here. Please note that you must register for ProvenanceWeek before you can complete your registration for the workshop.


10:00 Introductions

10:00-10:30 Securing Bioinformatic Pipeline Frameworks in the Cloud for Personalized Medicine Isabelle Perseil, Inserm, Paris

Abstract: Personalized medicine requires high throughput sequencing data processing pipelines/workflows in a care setting. In the cloud, these pipelines process shared sensitive genomic data, which are primary data used in the analysis. In order to ensure reproducibility of the pipelines and to provide repeatable results, input data annotations and access to primary data are necessary.

Therefore, because genomic data require security and privacy properties to be satisfied, a secure controlled access to these data must be provided.

We will discuss here about the current means for sharing and processing genomics data all along the common pipelines that are used for establishing personalized medicine, and in particular we will focus on the variant calling pipelines.

We will also discuss how these pipelines can be reused and how we can model the sensitive data provenance. Finally, we will define a new approach for securing a common data provenance model from acquisition of biological material, through its processing and storage to the data generation and analysis.

10:30-11:00 Using Provenance for Detecting Database Tampering, Alexander Rasin, DePaul University

Abstract: Database Management Systems (DBMSes) are tasked with keeping significant amounts of valuable and proprietary data; as a result they are often targeted by cyber-criminals for the purposes of data tampering and data exfiltration. Audit logs, sometimes further combined with forensic artifact analysis, can be used to detect malicious activity. Some types of tampering, however, leave no usable audit or forensic artifacts. For example, an administrator may maliciously change OS system time to "back-date" and then insert a false row with the backdated timestamp to alter a database state. Since the administrator has changed system time outside the scope of a database, there are no forensic artifacts within the database system to detect this form of tampering. Audit logs will show the record inserted with the backdated time and also offer no forensic evidence. One way to detect this inconsistency is by considering the lineage of the database as represented by an initial database state and queries in the audit log that modify that state. If the observed malicious state differs from the state that can be derived from database's lineage then there is evidence of tampering. In this talk, we will present use cases of database tampering and show when lineage of a database state can efficiently detect such forms of tampering. We will discuss heuristic methods that can approximate the full (but expensive) replay of database audit log and how forensic artifact analysis can further improve tampering detection in our approach.

11:00-11:30 Break

11:30-12:00 Keynote: Secure and Trustworthy Collection Architectures for Data Provenance

Professor Kevin Butler, University of Florida

Abstract: Data provenance, which traces the origin and subsequent modification of information as it is generated and processed, has been a long-standing research challenge. Ascertaining the provenance of generated in cyber-physical and simulation environments has been seen as key to securing critical infrastructure and ensuring the integrity of data from area such as climate science and drug trials. The key to ensuring that provenance is correctly managed is to ensure the security and fidelity of its initial collection. This talk will focus on research we have done to bring the promise of high-fidelity provenance collection to fruition, and to develop ways of applying provenance to cloud environments, across workflows to prevent data loss, and in industrial control environments.

Bio: Kevin Butler is an associate professor of Computer and Information Science and Engineering at the University of Florida, where he leads research in computer systems security within the Florida Institute for Cybersecurity Research. His work focuses on the security of systems and data, with a concentration on storage and embedded systems, mobile security and privacy, and cloud security. He also has interests in Internet security and applied cryptography.

Kevin received his Ph.D. in Computer Science and Engineering from the Pennsylvania State University in 2010, an M.S. in electrical engineering from Columbia University in 2004, and a B.Sc. in electrical engineering from Queen’s University at Kingston in 1999. He received the National Science Foundation CAREER award in 2013 and the Symantec Research Labs Graduate Fellowship in 2009. He was the conference chair for the 2017 IEEE Symposium on Security and Privacy and is TPC co-chair of ACM WiSec 2018. He is a Vice Chairman of the International Telecommunications Union’s Focus Group on Digital Financial Services, and co-founder and COO of CryptoDrop.

12:00-12:30 "It was the one-armed man, in the library, with the red herring" - or, what is the baseline effectiveness of process-centric attack detection?, Sidahmed Benabderrahmane, Ghita Berrada, James Cheney and Himan Mookherjee

Abstract: The hypothesis motivating provenance-based security analysis is that we can obtain greater insight into attack or intrusion behavior by analyzing a provenance graph capturing detailed information about process actions and causal relationships among processes and artifacts. To test this hypothesis, we should compare provenance-based techniques with what is possible based on simpler abstractions, for example by process-centric anomaly detection. Such a comparison is important for two reasons: first, it can help us to understand the cost-benefit tradeoff of detailed provenance recording compared with simpler and lighter-weight approaches; and second, it can help us to understand what types of behavior are easily detected via local analysis and what types of behavior require a more global (and costly) exploration of the provenance graph. To the best of our knowledge, no published results on provenance-based security compare with such a baseline.

In this talk, we present initial experiences with anomaly detection over provenance data using simple techniques for categorical or Boolean feature vectors. We first abstract the behavior of a process in a provenance trace as a Boolean feature vector. We can then apply a variety of existing unsupervised anomaly detectors. Such techniques include Attribute Value Frequency (AVF), Frequent Pattern Outlier Factor (FPOF), Outlier Degree (OD) and a number of others; we focus on the first three because they are easy to implement.

Based on our experiences so far, we believe that simple process-centric anomaly detection should be considered as a baseline with which to compare any more sophisticated method.

12:30-13:00 Dependence-Preserving Data Compaction for Scalable Forensic Analysis, Md Nahid Hossain, Junao Wang, R. Sekar, Scott D. Stoller

Abstract: We are developing a provenance-based approach and system for real-time detection and reconstruction of attacks on enterprise-scale IT systems. The foundation of our approach is a dependence graph abstraction of audit data, with nodes representing versions of entities (processes, files, network connections, etc.) and edges representing events. This graph captures the provenance of data associated with each entity. We use provenance to assess the trustworthiness and sensitivity of that data, and use these assessments for attack detection and forensic analysis, including source identification and impact analysis.

To enable rapid real-time attack detection and forensic analysis, we store the dependence graph for each host in main memory. This presents a significant scalability challenge. We meet this challenge using two techniques. First, we develop novel optimizations that eliminate redundant nodes and edges, while preserving dependence information needed for attack detection and forensic analysis. Specifically, we propose three notions of dependence preservation, which vary in which dependencies are preserved and consequently in how much reduction is possible. For each notion of dependence preservation, we present an efficient optimization algorithm that eliminates redundant nodes and edges as the graph is constructed. Second, we develop an extremely space-efficient main-memory representation for dependence graphs, and an associated space-efficient on-disk representation.

In experiments with audit data from a DARPA-led red team evaluation and from servers in our research lab, our source-dependence-preserving optimization achieved an average reduction of 9x in the number of events. The reduced graphs, stored in our space-efficient on-disk representation, are more than 50x smaller than the original audit logs, on average.

13:00-14:00 Lunch

14:00-14:30 Advanced Persistent Threat Conversational Analysis, Jianqiao Zhu, Gabriela F. Ciocarlie, Ashish Gehani, Vinod Yegneswaran, Dongyan Xu, Xiangyu Zhang, Somesh Jha, Kyu Hyung Lee, Jignesh Patel

Abstract: In this talk, we focus on carrying out advanced persistent threat (APT) analyses using a conversational-based approach to data analytics. To achieve this goal, we use Ava, a chatbot developed by DataChat Inc, which provides a natural way of reusing/sharing pipelines for forensic analysis. The underlying data is extracted using our TRACE system, a framework that combines host-level tracking techniques with an enterprise-wide tracking system. The directional goal of our project is to make systems more transparent by fine-grained and pervasive tracking of provenance information across enterprise networks.

14:30-15:30 Group Discussion - DARPA TC

15:30-16:00 Break

16:00-17:00 Group Discussion - Transition to Practice, Research Gaps, Next Steps

17:00 Conclusion