One of the main obstacles for a researcher in the oceanography community is to compile data, in a coherent way, from existing data sources that were created by different researchers. In many cases when conducting such a research, there is a need in manual data integration work that’s done by an expert, due to a large variety of formats.
Metadata is the descriptive information about data. Ideally, data description contains all crucial information about the data that can allow correct data usage. Metadata can carry the measured attributes, their names, units, accuracy and data layout, as well as the data lineage (which describes how the data was acquired).
When integrating heterogeneous data sources, scientific data description is essential for the correctness of the integration process, due to the essential requirement for a common language that will be applied to the integrated data (e.g, same fields’ units, measurement techniques, conversions dictionary).
For this reason we are developing the Oceanbase framework, which will have two major capabilities:
Search for relevant datasets over different existing dataset collections.
Integrate selected data sources. The integration process will apply techniques from the worlds of ontology-based data integration and text analysis to tackle the problem of combining scientific data description in the data integration process. Ontology-based data integration (OBDI) is a method that applies ontologies to consolidate several heterogeneous sources into one source. With the adoption of ontologies, one can create semantic integrability between data sources.