ETL Provenance Vocabulary

Latest version:



Andre Freitas, DERI
Benedikt Kämpgen, KIT

Ed Curry, DERI
Sean O'Riain, DERI
J.G. Oliveira, Amtera

In a Nutshell

Cogs is an ETL Provenance Vocabulary which extends the workflow semantics provided by OPM and Prov-O, allowing the description of ETL processes and objects.The Cogs vocabulary can be used to describe data transformations in general, outside the scope of ETL tools and practices. The core objective of the vocabulary is to improve the level of semantic interoperability of data transformation provenance descriptors, building upon the OPM and Prov-O standardization efforts.  

Introduction & Motivation

The growing availability of data on the Web provided by Web 2.0 applications and, more recently through Linked Data, brought the computational pattern expressed as ETL to reemerge in a scenario with additional complexity, where the number of data sources and the data heterogeneity that needs to be supported by ETL drastically increases. In this scenario, issues with data quality and trustworthiness may strongly impact the data utility for end-users. The barriers involved in building an ETL infrastructure under the complexity and scale of the available Web-based data supply scenario, demands the definition of strategies which can provide data quality warranties and also minimize the effort associated with data management.

In this context, provenance, the representation of artifacts, processes and agents behind a piece of information, becomes a fundamental element of the data infrastructure. Provenance have a large spectrum of applications including documentation & reproducibility and data quality assessment & trustworthiness and consistency-checking & semantic reconciliation. However, in an environment where data is produced and consumed by different systems, the representation of provenance should be made interoperable across systems.

Standardization efforts towards the convergence into a common provenance model generated the Open Provenance Model (OPM). OPM provides a basic description of provenance which allows interoperability on the level of workflow structure. The definition of this common provenance ground allows systems with different provenance representations to share at least a workflow-level semantics (the causal dependencies between artifacts, processes and the intervention of agents). OPM, however, is not intended to be a complete provenance model, but demands the complementary use of additional provenance models in order to enable uses of provenance which requires higher level of semantic interoperability.

Cogs is an ETL Provenance Vocabulary which extends the workflow semantics provided by OPM and Prov-O, allowing the description of ETL processes and objects. The Cogs vocabulary can be used to describe data transformations in general, outside the scope of ETL tools and practices. The core objective of the vocabulary is to improve the level of semantic interoperability of data transformation provenance descriptors, building upon the OPM and Prov-O standardization efforts.  

An ETL Provenance Model

Cogs extends the workflow structure of OPMV with a rich type structure. The ETL Provenance model behind Cogs assumes a three-layered provenance model where the bottom layer is defined by the OPMV workflow structure, the middle layer consists of the elements of the Cogs vocabulary, while the third layer is a domain-specific layer.

Cogs Provenance Model
Figure 1: The three-layered ETL provenance model.

The Cogs vocabulary is mainly defined by a taxonomy of around 150 classes. The large number of classes allows a rich description of ETL elements supporting an expressive ETL representation. Cogs also extends the workflow structure of OPMV with additional object properties targeting the creation and navigation of hierarchical workflow structures.

The vocabulary taxonomy is structured with high-level classes which are described below:

Execution: Represents the execution job (instance) of an ETL workflow. Examples of subclasses include AutomatedAdHocProcess and ScheduledJob.
State: Represents an observation of an indicator or status of one particular execution of an ETL process. These can range from execution states such as Running or Success to execution statistics, captured by the subclasses of the PerformanceIndicator class.
Extraction: Represents operations of the first phase of the ETL process, which involves extracting data from different types of sources. Parsing is a subclass example. cogs:Extraction is an opmv:Process
Transformation: Represents operations in the transformation phase. Typically this is the phase which encompasses most of the semantics of the workflow, which is reflected on its number of subclasses. Examples of classes are RegexFilter, DeleteColumn, SplitColumn, MergeRow, Trim and Round. cogs:Transformation is an opmv:Process.
Loading: Represents the operations of the last phase of the ETL process, when the data is loaded into the end target. Example classes are ConstructiveMerge and IncrementalLoad. cogs:Loading is an opmv:Process.
Object: Represents the sources and the results of the operations on the ETL workflow. These classes, such as ObjectReference, Cube or File, aim to give a more precise definition of opmv:Artifact (every cogs:Object is an opmv:Artifact) and, together with the types of the operations that are generating and consuming them, capture the semantics of the workflow steps.
Layer: Represents the different layers where the data can reside during the ETL process. PresentationArea and StagingArea are some of the subclasses

Cogs at a Glance

Classes: AdHocProcess | AggregateRows | Append | ApplyFormula | ApplyScript | Assignment | AutomatedAdHocProcess | AutomatedMatching | AutomatedValidation | CalculatedValue | Ceil | CharacterSetConversion | Class | Column | ColumnOperation | ConstructiveMerge | Copy | Cube | DSN | DataAccessLayer | DataManagementLayer | DataMart | DataStream | Database | Dataset | DatetimeConversion | Deduplication | DeleteColumn | DeleteQuery | DeleteRow | DeleteTriple | DestructiveMerge | Device | DimensionTable | Endpoint | Event | Exception | Execution | ExecutionStatus | Extraction | FactTable | Fail | FieldDecoding | File | FileLookup | FillDown | Filter | FormatRevision | Formula | FullRefresh | GraphOperation | HumanMatching | HumanValidation | IncrementalLoad | InitialLoad | Input | InsertColumn | InsertQuery | InsertRow | InsertTriple | InstanceMapping | Job | JoinRows | KeyGeneration | KeyRestructuring | LastError | Layer | Loading | LoadingProcess | LoadingType | Log | Lookup | Lowercase | ManualAdHocProcess | ManuallyStartedJob | Mapping | MappingFile | MappingProcess | MergeRow | Metadata | Method | Move | NumericCast | NumericOperation | ObjectReference | ObjectRepresentation | Objects | Operations | Operator | Order | Output | Parsing | Paste | PerformanceIndicators | PredefinedMatching | PresentationArea | Program | Publication | Query | RDFGraph | RDFNamedGraph | RESTLookup | RegexFilter | RejectedData | RenameColumn | Replace | Round | Row | RowOperation | Rule | RulesBasedMatching | Running | ScheduledJob | Schema | Script | SelectQuery | SemanticSimilarity | Sensor | Server | Service | SimilarityMatching | SortRow | Source | Split | SplitColumn | StaggingArea | StagingAreaArtifact | StoredProcedure | StringFilter | StringOperation | StringSimilarity | Success | Table | TableLookup | TerminologicalMapping | Transformation | TransformationProcess | Trigger | Trim | Triple | UnitConversion | UpdateQuery | Uppercase | Validation | ValueOperation | View | WebServiceLookup | XSL

Properties: artifactUsage | associatedDSN | associatedEndpoint | associatedGraph | associatedTable | dependsOn | elaspedTime | hasLastError | hasLoadingType | hasQuery | loadInto | orderBy | programUsed | representedBy | representsArtifact | rowsPerSecond | scheduleDatetime | programUsed | isPartOf | hasStartPoint | hasEndPoint


The following provenance descriptor provides an excerpt of the provenance trail behind a data resource ("[ final value ]") which was generated by an ETL workflow. The example ETL workflow extracts information from printer logs to calculate the carbon CO2 emissions due to paper consumption. The descriptor was simplified to facilitate its understanding (timestamps and serial numbers associated with URIs were removed).

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix opmv:  <http://purl.org/net/opmv/> .
@prefix cogs:  <http://vocab.deri.ie/cogs#> .

[ final value ]    <http://purl.org/dc/terms/provenance>   <http://example.com/provenance/artifact/TotalGreenhouseGasEmissionsByWeightResultingFromPrinting> .

<http://example.com/provenance/process/GenerateSustainabilityReport>  cogs:hasStartPoint   <http://example.com/provenance/process/selectRecordFromPrintFile>   .
<http://example.com/provenance/process/GenerateSustainabilityReport>  cogs:hasEndPoint   <http://example.com/provenance/process/AggregateTotalGreenhouseGasEmissionsByWeightResultingFromPrinting>   .
<http://example.com/provenance/process/AggregateTotalGreenhouseGasEmissionsByWeightResultingFromPrinting>  cogs:isPartOf   <http://example.com/provenance/process/GenerateSustainabilityReport> .
<http://example.com/provenance/process/InsertTriplesIntoDataCube>  cogs:isPartOf   <http://example.com/provenance/process/GenerateSustainabilityReport> .
<http://example.com/provenance/process/convertCSVToRDF> cogs:isPartOf  <http://example.com/provenance/process/GenerateSustainabilityReport> .
<http://example.com/provenance/process/selectRecordFromPrintFile> cogs:isPartOf  <http://example.com/provenance/process/GenerateSustainabilityReport> .

<http://example.com/provenance/process/InsertTriplesIntoDataCube>  cogs:precededBy   <http://example.com/provenance/process/AggregateTotalGreenhouseGasEmissionsByWeightResultingFromPrinting>
<http://example.com/provenance/process/convertCSVToRDF>  cogs:precededBy   <http://example.com/provenance/process/InsertTriplesIntoDataCube>
<http://example.com/provenance/process/selectRecordFromPrintFile> cogs:precededBy   <http://example.com/provenance/process/convertCSVToRDF> .

      a       cogs:RDFGraph , opmv:Artifact ;
      opmv:wasGeneratedBy <http://example.com/provenance/process/AggregateTotalGreenhouseGasEmissionsByWeightResultingFromPrinting> .

      a       cogs:Aggregation , opmv:Process; 
    opmv:used <http://example.com/provenance/artifact/cubeValue2670022> , <http://example.com/provenance/artifact/cubeValue2676570> , ...

      a       cogs:InsertTriple , cogs:ApplyScript , opmv:Process ;
      opmv:used <http://example.com/provenance/artifact/singlePrintingEmission20101019134351> .
      opmv:used <http://example.com/provenance/artifact/printFile/19102010> .

      a       cogs:RDFGraph , opmv:Artifact ;
      opmv:wasGeneratedBy   <http://example.com/provenance/process/convertCSVToRDF>

      a       cogs:TransformationProcess , cogs:ConstantFactorApply , cogs:Parsing , opmv:Process ;
      opmv:used <http://example.com/provenance/artifact/singlePrintingEmission20101019134351> ;
      cogs:factor "0.0165"^^<http://www.w3.org/2001/XMLSchema#double> ;
      cogs:programUsed "https://example.com/ETL/trunk/PrintTracking/PrintTrackingToRDF/src/main/java/ie/deri/printing/papercut/CSVLogToETLModel.java#convertToRDF()" .

      a       cogs:Row , opmv:Artifact ;
      opmv:wasGeneratedBy   <http://example.com/provenance/process/selectRecordFromPrintFile> .

      a       cogs:Lookup , opmv:Process ;
      opmv:used <http://example.com/provenance/artifact/printFile/19102010> .

      a       cogs:File , opmv:Artifact .


Please forward any suggestions and corrections to andre (dot) freitas -at- deri (dot) org.


André Freitas, Benedikt Kämpgen, João Gabriel Oliveira, Sean O'Riain, Edward Curry, Representing Interoperable Provenance Descriptions for Web-based ETL Workflows. In Proceedings of the 3rd International Workshop on Role of Semantic Web in Provenance Management (SWPM 2012), Extended Semantic Web Conference (ESWC), Heraklion, Crete, 2012. (pdf).

Open Provenance Model Vocabulary Specification,

W3C Provenance Ontology, http://www.w3.org/TR/prov-o/

Change Log

Vocabulary refinement and corrections - v. 0.2, 10/05/2012
Vocabulary creation - v. 0.1, 03/03/2012