ETL Provenance Vocabulary
Andre Freitas, DERI
Benedikt Kämpgen, KIT
Ed Curry, DERI
Sean O'Riain, DERI
J.G. Oliveira, Amtera
Cogs is an ETL Provenance Vocabulary which extends the workflow semantics provided by OPM and Prov-O, allowing the description of ETL processes and objects.The Cogs vocabulary can be used to describe data transformations in general, outside the scope of ETL tools and practices. The core objective of the vocabulary is to improve the level of semantic interoperability of data transformation provenance descriptors, building upon the OPM and Prov-O standardization efforts.
The growing availability of data on the Web provided by Web 2.0 applications and, more recently through Linked Data, brought the computational pattern expressed as ETL to reemerge in a scenario with additional complexity, where the number of data sources and the data heterogeneity that needs to be supported by ETL drastically increases. In this scenario, issues with data quality and trustworthiness may strongly impact the data utility for end-users. The barriers involved in building an ETL infrastructure under the complexity and scale of the available Web-based data supply scenario, demands the deﬁnition of strategies which can provide data quality warranties and also minimize the effort associated with data management.
In this context, provenance, the representation of artifacts, processes and agents behind a piece of information, becomes a fundamental element of the data infrastructure. Provenance have a large spectrum of applications including documentation & reproducibility and data quality assessment & trustworthiness and consistency-checking & semantic reconciliation. However, in an environment where data is produced and consumed by different systems, the representation of provenance should be made interoperable across systems.
Standardization eﬀorts towards the convergence into a common provenance model generated the Open Provenance Model (OPM). OPM provides a basic description of provenance which allows interoperability on the level of workﬂow structure. The deﬁnition of this common provenance ground allows systems with different provenance representations to share at least a workﬂow-level semantics (the causal dependencies between artifacts, processes and the intervention of agents). OPM, however, is not intended to be a complete provenance model, but demands the complementary use of additional provenance models in order to enable uses of provenance which requires higher level of semantic interoperability.
Cogs is an ETL Provenance Vocabulary which extends the workflow semantics provided by OPM and Prov-O, allowing the description of ETL processes and objects. The Cogs vocabulary can be used to describe data transformations in general, outside the scope of ETL tools and practices. The core objective of the vocabulary is to improve the level of semantic interoperability of data transformation provenance descriptors, building upon the OPM and Prov-O standardization efforts.
Cogs extends the workflow structure of OPMV with a rich type structure. The ETL Provenance model behind Cogs assumes a three-layered provenance model where the bottom layer is defined by the OPMV workflow structure, the middle layer consists of the elements of the Cogs vocabulary, while the third layer is a domain-specific layer.
The Cogs vocabulary is mainly defined by a taxonomy of around 150 classes. The large number of classes allows a rich description of ETL elements supporting an expressive ETL representation. Cogs also extends the workﬂow structure of OPMV with additional object properties targeting the creation and navigation of hierarchical workﬂow structures.
Figure 1: The three-layered ETL provenance model.
The vocabulary taxonomy is structured with high-level classes which are described below:
Execution: Represents the execution job (instance) of an ETL workﬂow. Examples of subclasses include AutomatedAdHocProcess and ScheduledJob.
State: Represents an observation of an indicator or status of one particular execution of an ETL process. These can range from execution states such as Running or Success to execution statistics, captured by the subclasses of the PerformanceIndicator class.
Extraction: Represents operations of the ﬁrst phase of the ETL process, which involves extracting data from diﬀerent types of sources. Parsing is a subclass example. cogs:Extraction is an opmv:Process
Transformation: Represents operations in the transformation phase. Typically this is the phase which encompasses most of the semantics of the workﬂow, which is reﬂected on its number of subclasses. Examples of classes are RegexFilter, DeleteColumn, SplitColumn, MergeRow, Trim and Round. cogs:Transformation is an opmv:Process.
Loading: Represents the operations of the last phase of the ETL process, when the data is loaded into the end target. Example classes are ConstructiveMerge and IncrementalLoad. cogs:Loading is an opmv:Process.
Object: Represents the sources and the results of the operations on the ETL workﬂow. These classes, such as ObjectReference, Cube or File, aim to give a more precise deﬁnition of opmv:Artifact (every cogs:Object is an opmv:Artifact) and, together with the types of the operations that are generating and consuming them, capture the semantics of the workﬂow steps.
Layer: Represents the diﬀerent layers where the data can reside during the ETL process. PresentationArea and StagingArea are some of the subclasses.
Classes: AdHocProcess | AggregateRows | Append | ApplyFormula | ApplyScript | Assignment | AutomatedAdHocProcess | AutomatedMatching | AutomatedValidation | CalculatedValue | Ceil | CharacterSetConversion | Class | Column | ColumnOperation | ConstructiveMerge | Copy | Cube | DSN | DataAccessLayer | DataManagementLayer | DataMart | DataStream | Database | Dataset | DatetimeConversion | Deduplication | DeleteColumn | DeleteQuery | DeleteRow | DeleteTriple | DestructiveMerge | Device | DimensionTable | Endpoint | Event | Exception | Execution | ExecutionStatus | Extraction | FactTable | Fail | FieldDecoding | File | FileLookup | FillDown | Filter | FormatRevision | Formula | FullRefresh | GraphOperation | HumanMatching | HumanValidation | IncrementalLoad | InitialLoad | Input | InsertColumn | InsertQuery | InsertRow | InsertTriple | InstanceMapping | Job | JoinRows | KeyGeneration | KeyRestructuring | LastError | Layer | Loading | LoadingProcess | LoadingType | Log | Lookup | Lowercase | ManualAdHocProcess | ManuallyStartedJob | Mapping | MappingFile | MappingProcess | MergeRow | Metadata | Method | Move | NumericCast | NumericOperation | ObjectReference | ObjectRepresentation | Objects | Operations | Operator | Order | Output | Parsing | Paste | PerformanceIndicators | PredefinedMatching | PresentationArea | Program | Publication | Query | RDFGraph | RDFNamedGraph | RESTLookup | RegexFilter | RejectedData | RenameColumn | Replace | Round | Row | RowOperation | Rule | RulesBasedMatching | Running | ScheduledJob | Schema | Script | SelectQuery | SemanticSimilarity | Sensor | Server | Service | SimilarityMatching | SortRow | Source | Split | SplitColumn | StaggingArea | StagingAreaArtifact | StoredProcedure | StringFilter | StringOperation | StringSimilarity | Success | Table | TableLookup | TerminologicalMapping | Transformation | TransformationProcess | Trigger | Trim | Triple | UnitConversion | UpdateQuery | Uppercase | Validation | ValueOperation | View | WebServiceLookup | XSLProperties: artifactUsage | associatedDSN | associatedEndpoint | associatedGraph | associatedTable | dependsOn | elaspedTime | hasLastError | hasLoadingType | hasQuery | loadInto | orderBy | programUsed | representedBy | representsArtifact | rowsPerSecond | scheduleDatetime | programUsed | isPartOf | hasStartPoint | hasEndPoint
Please forward any suggestions and corrections to andre (dot) freitas -at- deri (dot) org.
W3C Provenance Ontology, http://www.w3.org/TR/prov-o/