Data provenance

Data provenance tracks the sources and history of empirical data, including details about its original collection and its measurement uncertainty, any subsequent transformations or selections for cleaning, outlier removal, etc., and any computations to make it suitable for comparisons or further analysis. Data provenance will also include details about how the data was collected, who collected it, who funded its collection, its subsequent chain of custody, who owned and now owns the data and any licensing information. Data provenance is fundamental to understanding the import and utility of empirical data, and its cultivation and careful curation are essential to ensure the reproducibility of analyses and the validation of scientific conclusions. Data provenance is a necessary component, perhaps along with blockchain methods, in any strategy to check, validate, authenticate or otherwise establish the trustworthiness of data and results that may be based on them. Ellison et al. (2020) describe research on a machine-readable version of data provenance, including software tools that should be helpful for this purpose.

In principal, every datum should be traceable for review purposes via its provenance. Likewise, any calculation, inference or conclusion based on or employing any datum as evidence should be updatable by revisiting the datum value and retracing the steps of the calculation. Thus, any data collation or unification of disparate data or datasets implies a merger of their respective provenances. In some cases, the mergers are straightforward and the details can be unified. For instance, if the data were collected in the same measurement units with the same uncertainty, these characteristics are shared in the unified data set. If one value was measured in metres and the other measured in centimetres, then the unification might need to select a common unit and convert one or both of the individual values. Such a change should of course be reflected in the data provenance of the unified data set, in part via reference to the provenance of the constituent data.

Several of our projects have flirted with tracking data provenance, including Constructor which elaborates the notion of input justification (q.v.), and the Statistics Show which adds the wrinkle of automatically testing data for its adherence to assumptions needed for a test or analysis. The ideas of justifying input parameters and automatically testing assumptions against data address the issue of data provenance from the perspective of a software system or program which needs to access and marshall data from available data sets to be used in some particular analysis. The more basic elements required for a scheme to track data from its origins that focuses directly on data collection issues are being addressed in our most recent work under the DigiTwin programme grant, which considers data provenance as it originates in data collection from sensors.

The material below should probably be moved to and integrated with the Input justification page. It should be replaced by the ideas of Marco De Angelis and Mattia <<>> on recording and tracking sensor features along with their reported values through a calculation stream.

Input Parameter class

More useful information about inputs embodied by the rust-coloured fields below, and the dialogue boxes further down, might either be added to the Number class, or used to define an Input Parameter class that inherits from the Number class (or is it the other way around?). Possible other information includes

Original, where did the value come from? (as URL or the value originally entered by a user)
Interpreted, i.e., the value(s) that will be used in the analysis
Uncertainty
Ensemble
Nature
Informant
Justification
References
History

These fields were very useful in the Constructor software (see attached document), and users considered them valuable. It might be reasonable to support their entry and storage as part of input files, if not every number internally. They are extremely helpful to document and maintain the provenance of data and parameter inputs. They do not need to be used in calculations, well, except maybe History if you keep that. The infrastructure should create self-documenting input files that keep tabs on

–Description of the variable, including units
–Who characterized the uncertain number
–The nature of the estimate or observation
–What is the statistical population or ensemble
–What the reason or argument was
–What the relevant references are
–What or where the supporting data are

and other relevant documentary or license information needed by engineers, reviewers, data or process auditors, etc.

The Justification editor dialogue (below) appears when a user double-clicks on any input parameter specified in the DigiTwin GUI to enter or view a justification for that input. The Justification dialogue fields and their labels could be entirely flexibly designed, or flexibly augmented, as via JSON data structures. The input fields on the main GUI might show a user's inputs on a pale yellow background to indicate when the entry has been made without any justification. Double clicking on the input field might open the Justification editor dialogue. After an entry is justified, the pale yellow highlighting reverts to white. This allows a user to see at a glance whether inputs are well documented or not. In principle, we might argue that all of the fields in the Justification editor dialogue should permit arbitrary strings so the user can make natural-language descriptions (not just select from listed options which are suggested possible entries), including not making an entry even if that induces ambiguity. The entries use RichText formatting to support italics, boldfacing, colours, OEM characters, superscripts, etc. Maybe it needs spell checking too.

The Ensemble field is not depicted in the dialogues shown below. Its optional values could perhaps include

Temporal steps
Spatial sites
Manufactured components
Repeated measurements
Spatial and temporal variation
etc.

The ensemble would be an important field for serious uncertainty analysis, and checking these entries could be used to validate an analyses in the same way that units conformance can. Adding spatial variation and temporal variation make not make any more sense than adding meters to seconds. Larry Barnthouse has an argument/presentation about this, and maybe a paper.

Page updated

Google Sites

Report abuse