This section outlines the Semantic ETL Framework (SETL in short). To design an SDW, SETL follows a demand-driven approach. The two steps of the demand-driven approach are: 1) requirements engineering: identifying and analyzing the requirements of business users and decision makers, and 2) data integration: building the target TBox and the ETL based on the gathered requirements. The first step is beyond the scope of this thesis, and SETL focuses on the second step. In short, the integration steps SETL supports are: defining a target TBox for the SDW based on the given requirements, extracting data from multiple heterogeneous data sources, transforming source data into RDF triples following the semantics encoded in the target TBox, linking data internally and externally, and loading the data into a triple store, and/or publishing the data on the Web as Linked Data.
As shown in Figure 2, SETL is divided into three layers (separated by red-colored dotted lines): the Definition Layer, ETL Layer, and Data Warehouse Layer. In the Definition Layer, an ETL designer defines the target TBox, data sources, and the mappings among the source and target TBoxes. Using the SDW TBox Definition component, the designer defines the target TBox based on the requirements; the QB4OLAP vocabulary is used to annotate the target TBox with MD semantics. The Define Mapping component is used to map between a source and the target TBox. To create a semantic layer on top of a non-semantic data source, the TBox Extraction component is used to extract a TBox from a non-semantic data source. The R2RML component is used to generate RDB to RDF Mapping Language (R2RML) mappings for a nonsemantic source, which is later used to generate RDF triples for the source using an R2RML engine.
In the ETL Layer, the designer can design an ETL process to create the ABox of the SDW from the available sources. The Extraction component retrieves data from the sources, which are further cleansed and formatted by the Traditional Transformation component. Then, the Semantic Transformation component transforms the data into RDF triples according to the semantics encoded in the target TBox. As a sub-task, Semantic Transformation stores the IRI information of each resource/data in the Provenance graph to preserve the uniqueness of each resource. Internal data can be linked with other external knowledge bases using the External Linking component. The RDF triples can be dumped as a local file using the SaveToFile component. Finally, the Load component loads the RDF triples, directly from the Semantic Transformation component or from the dumped file, into a triplestore that can be queried by end-users using OLAP or SPARQL queries. The intermediate results from each phase are stored in a data staging area. As ETL is an iterative block and repeated for each flow, a curved arrow is used in Figure 2. The Data Warehouse Layer stores the RDF triples.