Talks abstracts

Invited talks:

James Cheney, University of Edinburgh, Language-based incremental computation

In this talk I'll outline work (mostly by others) on incremental, self-adjusting and bidirectional computation. By incremental computation I mean techniques for propagating changes to the input through a program to obtain (efficient) changes to the output. Self-adjusting computation (in this talk) means recomputing the input efficiently after changes to the input, using caching to avoid recomputing subexpressions whose results have not changed. Finally, bidirectional computation means updating the input to a computation to be consistent with a proposed new output: a generalization of the view update problem. All three techniques have been studied extensively in the programming languages community and have potentially interesting connections to provenance. I'll finish the talk by outlining a recent contribution to incrementalizing view updates to relational databases, which may be of interest for computing "missing answer" or hypothetical explanations for database settings.

Frank Mc Sherry, ETH Zurich, CH, Differential dataflows

Diffential dataflow is a recent framework for expressing scalable data-parallel computations, and for incrementally updating these computations as their inputs change. Notably, differential dataflow supports iteration without monotonicity requirements, opening up the ability to express and incrementally maintain a substantially larger class of algorithms than one can with directed acyclic dataflow graphs. In this talk I'll develop the mathematics underlying differential dataflow, and overview the implementation techniques used to make it all work out. We will tour some non-trivial algorithms, and tie everything back to provenance as we see how differential dataflow can be used to explain its own computations, interactively. This talk reflects joint work with a great many people, including what was once the Naiad team at MSR-SV, the Systems Group at ETH Zürich, and many other collaborators.

Participants contributions (short talks):

Melanie Herschel, Universität Stuttgart, Germany, Incremental Recomputation in Data Integration

Data integration generally aims at combining data coming from various, heterogeneous sources into some unified representation. Performing data integration requires to perform various data engineering tasks, e.g., to extract relevant data from the sources, data scrubbing, entity resolution, data fusion, data restructuring, data filtering, etc. This results in complex data integration processing pipelines. In many scenarios, these are developed and maintained incrementally.

In this context, we argue that incremental recomputation may be beneficial to speed up the incremental development process by computing and observing the results of a change to the pipeline more quickly and to perform what-if analysis before actually executing a new version of the pipeline. After introducing provenance capture for data integration, we discuss if and how provenance plays a role in supporting such incrementaion recomputation.

Marta Mattoso & Renan Souza, COPPE-UFRJ, Brasil, Collecting Provenance of Steering Actions

In long-lasting scientific workflow executions in HPC machines, computational scientists (or simply users) often need to fine-tune several workflow parameters. These tunings are done through user steering actions that may significantly improve performance (e.g., reduce execution time) or improve the overall results. However, if the tunings are not properly registered, users can lose track of what has been adapted. We discuss on using provenance data management to address the problem of tracking online parameter fine-tuning in dynamic workflows steered by users. We developed a lightweight approach to capture and manage provenance of the steering actions online with negligible overhead. The resulting provenance database relates user steering data with data for domain, provenance, execution, and performance, and is available for analysis at runtime. We discuss challenges and show how users may get a global view of the execution behaviour, providing insights to determine when and how to tune. We present the applicability of our solution in the Oil and Gas domain. With this steering action provenance database, the user can determine which tuned parameters influenced simulation accuracy and performance.

Paul Groth, Elsevier Labs, NL, Progressive Provenance Capture Through Re-computation

Provenance capture relies upon instrumentation of processes (e.g. probes or extensive logging). The more instrumentation we can add to processes the richer our provenance traces can be, for example, through the addition of comprehensive descriptions of steps performed, mapping to higher levels of abstraction through ontologies, or distinguishing between automated or user actions. However, this instrumentation has costs in terms of capture time/overhead and it can be difficult to ascertain what should be instrumented upfront. In this talk, I'll discuss our research on using record-replay technology within virtual machines to incrementally add additional provenance instrumentation by replaying computations after the fact.

Paolo Missier, Newcastle University, The ReComp project: an overview

Complex and computationally expensive processes are common in science and engineering, as well as in Data Science. The outcomes of such processes are often time-sensitive, as they depend on algorithms, tools, and reference databases that evolve over time, often independently of one another. This suggests that some of the outcomes may become invalid and require refreshing. However, these computations can be expensive, consuming tens of CPU hours each, and not all past instances are equally affected by all changes. The ReComp project investigates general methods for optimising re-computation of workflow-based processes in response to changes in the underlying reference data. We address this problem by systematically collecting workflow-level provenance about the process execution, its inputs, outputs, and dependencies, as well as performance metrics. Over time, these records form an ever-growing history database that we can then use to assess the future impact of observed changes and to optimise the re-computation. ReComp is demonstrated on on two case studies, namely in Genomics (human variant calling and interpretation) and resource-intensive simulations of urban flood events. The short talk will provide an overview. Further details can be found on the project site.

Folker Meyer, Argonne National Labs, Provenance and recomputing in the realm of large scale environmental sequence analysis

I will talk about the needs for provenance and recomputing in the realm of large scale environmental sequence analysis taking the perspective of a large-scale analysis provider (MG-RAST, https://mg-rast.org). MG-RAST caters to tens of thousands of scientists from many domains. Buoyed by dramatically sinking DNA sequencing cost, the entire field of environmental genomics is undergoing rapid expansion of data set sizes, numbers and the resulting demand for analysis. The current state of the field is characterized by uncertainty about in-silico analysis results as well as the need to update results based on updated knowledge in the underlying knowledge databases.

Khalid Belhajjame, University Paris-Dauphine: Answering Why-Not queries Against Scientific Workflow Provenance.

Why-not queries help scientists understand why a given data item was not returned by the executions of a given workflow. While answering such queries has been investigated for relational databases, there is only one proposal in this area for workflow provenance, viz. the Why-Not algorithm. This algorithm makes the assumption that the modules implementing the steps of the workflow preserve the attributes of the input datasets. This is, however, not the case for all workflow modules. We drop this assumption, and show in this paper how the Web can be harvested to answer why-not queries against work!ow provenance. The approach we propose requires the recomputation of some parts of the workflow, or more specifically, the invocation of the workflow module.

Paweł Gora, University of Warsaw, Poland: TensorCell - approximating outcomes of computer simulations using machine learning algorithms

In the talk, I will present the TensorCell project aiming to approximate outcomes of computer simulations using machine learning algorithms, in particular - deep neural networks. The first successful application of this technique is a case of microscopic traffic simulations computing the total time of waiting on a red signal as a function of traffic signal settings. Results of experiments showed that it is possible to approximate outcomes of such simulations using neural networks with an average error at the level 1.2%, but obtaining results a few orders of magnitude faster than in case of computer simulations, which may have important applications in a real-time traffic management. A similar technique may be applied, e.g., in medicine, to compute the evolution of cancer cells with the aim to optimize the cancer treatment process. It may also find applications in bioinformatics and many other domains.

Roly Perera, University of Glasgow: Self-Explaining Computation with Explicit Change

Interactive notebooks such as Jupyter and Observable interleave data, computation and visualisations with human-authored narrative. Related developments include tools for data-driven journalism like The Gamma, new forms of digital dissemination like the online journal Distill, and “explorable explanations”, computational documents which allow end users to manipulate the parameters of an underlying model.

These new tools and proofs-of-concept reflect an emerging new role for computation: as a literate medium for expressing and communicating technical ideas, experimental results, methods, arguments and scenarios.

I will briefly present some of the recent efforts in this area, discuss some limitations of existing approaches, and then try to motivate a research project which addresses these limitations using ideas from self-explaining computation and incremental computation. The working hypothesis is that for computations to fulfil their potential as human-readable explanations, we need to make their granular structure explicit, so that users can explore how parts of a data set or model relate to parts of the outcome. The guiding principle is captured by the slogan “the explanation of a change is a change in an explanation”.

Bertram Ludascher, UIUC, USA, Incremental Recomputation: Those who cannot remember the past are condemned to recompute it

Incremental recomputation has applications, e.g., in databases and workflow systems. Methods and algorithms for recomputation depend on the underlying model of computation (MoC) and model of provenance (MoP). This relation is explored with some examples from databases and workflow systems.

Tomasz Pawlowski, Jacek Sroka, University of Warsaw, Poland: Handling late data in process mining algorithms

The are many cases where data is produced constantly and processed as a stream. Those include web page click streams, stock market events and transaction systems. Lots of research interest is given to processing of such streams. Interesting algorithms are developed to compute or ap- proximate different statistics like frequency moments on values observed in some large windows where the whole window does not fit in the memory.

At the same time many process mining techniques were developed to analyze and visualize processes repeating at high scale. For instance, in many hospitals, information about treatments are registered (date, time, performed action, medical staff involved) for financial reasons, as well as for analysis of treatment outcomes.

Yet, most of process mining tools operate offline, on static event logs. There exist many use cases where online analysis of streams of logs may lead to interesting results. An example would be an online game log showing order in which users complete the assignments, e.g., that always after A they do B and C and the order of B and C is arbitrary, while C is always subsequently followed by D. Such process models can be analyzed for inefficiencies and anomalies.

Many interesting problems arises from attempts to apply process min- ing algorithms on streams of event logs. In this presentation we focus on late data arrival, which occurs when processing time of some events differs much from their occurrence, e.g. when the user plays the game while being offline and connects to the server afterwards. Adapting pro- cess mining algorithms to handle such a late data can result in models of increased quality. We investigate how late data can be handled in various process mining algorithms without repeating the whole process discovery from scratch. We also investigate how to keep track of evolution of the process in time.

Ashish Gehani, SRI: Supporting Incremental Re-Computation with Whole System Provenance: Issues and Approaches

Incremental re-computation has been supported in a range of systems, from database query engines to software build systems. Typically, results are cached and reused, avoiding the need to repeat sub-executions when the inputs remain the same. Incorporating provenance has the potential to further optimize this process, by reducing the fraction of computation that needs to be performed again.

Whole system provenance promises to facilitate re-computation of arbitrary application workloads. A variety of techniques can be utilized to collect such provenance, ranging from invasive but performant techniques, such as in-kernel logging, to just-in-time approaches, such as dynamic binary instrumentation. Typically, such systems provide sound inferences about provenance, but emit a limited record of activity. The gaps in coverage can introduce challenges when attempting to ensure that every past computation is completely reproducible.

We consider a number of practical issues that arise in this context, and outline approaches to address them. The challenges include ephemeral intermediate artifacts, conflated causality, dynamic runtime environments, and external dependencies.

Pramod Bhatotia, University of Edinburgh: The Marriage of Incremental and Approximate Computing

Incremental and approximate computations are increasingly being adopted for data analytics to achieve low-latency execution and efficient utilization of computing resources. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an approximate output for a job instead of the exact output. Both paradigms rely on computing over a subset of data items instead of computing over the entire dataset, but they differ in their means for skipping parts of the computation. Incremental computing relies on the memoization of intermediate results of sub-computations, and reusing these memoized results across jobs. Approximate computing relies on representative sampling of the entire dataset to compute over a subset of data items.

In this work, we observe that these two paradigms are complementary, and can be married together! Our idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncApprox based on Apache Spark Streaming. Our evaluation shows that IncApprox achieves the benefits of both incremental and approximate computing.