Expanded Project Suggestion: Building Linked Open Data

Introduction (for Task 6 ~ Task 7)

Mapping Performance Experiments are planned to be executed by conversions in XML forms. And they measure mapping rates such as perfect or partially mapping rates at lexical level including transferring rate. If we measure mapping rates at semantic level, it will give more high mapping rates, because the Common Terms are chosen considering lexical and semantic level interoperability. But, it will be little complex to define identical terms at semantic level, because we should decide how much degree we will accept as identical meaning semantically but objectively.

In addition to, Professor Dubin suggests buiding Linked Open Data (LOD) with CT for the future, because the IOPDL with the CT is for the future international digital library. As a result of rough research, I may say that Linked Open Data (LOD) is very advanced research these days in library and museum fields, because few research projects have conducted in the world not much in America. But, I do not totally agree with opening both metadata and objects on the Web like being naked without protection. And it will be in contradiction of giving autonomy and authority to cooperating Well-Designed Digital Libraries which will consist of the proposed IOPDL in managing their metadata, objects, and copyrights. However, I admit that conducting LOD with the developed CT will address future directions where metadata and object should go, either each library database as tradition or on the Web, or both. The IOPDL proposal and CT project suggest conducting an integrated searching engine by a union catalog with the CT. This method can be compared with an integrated searching engine for the conducted LOD. We can find out which way more secure and effective for the future library world is. Thus, I strongly recommend conducting LOD with the developed CT for building an integrated search engines for Harvard, MIT, and UIUC.

Research Review about Linked Open Data

Linked Data is defined as using the Semantic Web to connect related data for exposing, sharing, and connecting data using URIs and RDF (Linked Data). Szekely and others state that “LOD provides an approach to publishing data in a standard format (called RDF) using a shared terminology (called a domain ontology) and linked to other data sources. The linking is particularly important because it relates information across sources breaks down data silos and enables applications that provide rich context” (Szekely, Knoblock, & Wan, 2014). Through rough research for LOD, I found that only several pioneer research projects have implemented LOD on the Web.

Europeana implemented Linked Open Data as a way of publishing structured data. Europeana indicates that LOD makes metadata to be connected between related resources and enriched, and makes it access easily on the Web (Europeana). The goal of releasing Europeana metadata as Linked Open Data is to ‘Make their heritage available to users wherever they are, whenever they want it’ and to enhance the engagement of different communities by allowing re-use and connectivity” (Europeanna).

Amsterdam Museum in Netherlands made the world-class collections available as Linked Open Data. Marijke Oosterbroek, the Museum’s manager of E-culture says: “we opened the digital depot in March 2011 and launched the whole collection of around 70,000 objects online. It contains sub-collections which have international historical and art historical value, so it’s important that everybody in the world can find these objects and use them. Our policy: here's the data, here are the images, use them and reuse them!”

British Library released “3 million British National Bibliography records to the public via the CC0 public domain dedication” (Case Studies British Library). This data is reusable by the JISC OpenBibliography project work. The project has been “loaded into a Virtuoso store that is queriable through the SPARQL Endpoint and the URIs that we have assigned each record use the ORDF software to make them dereferencable, supporting perform content auto-negotiation as well as embedding RDFa in the HTML representation" (Case Studies British Library).

The LODAC Museum project in Japan was building an LOD of museum collection information advancing data sharing. LODAC Museum is an LOD of museum information that consists of 40 million triples of 114 museums and institutes in Japan and that was conducted by the LODAC project in 2012 (LODAC Museum: Linked Open Data for Academia).

In America, connecting the Smithsonian American Art Museum to the Linked Data Cloud was done in 2013. The project team highlights the database-to-RDF mapping process was very complex. They linked the dataset to hub datasets of DBpedia and the Getty Vocabularies linking 41,000 objects and 8,000 artists.

Harvard had tried to expose bibliographic information of MARC for Countway Library’s digital collections as Linked Open Data and to enhance searches utilizing new data points in 2012. A chosen set of 67 records are converted into a tap-separated file after selecting a subset of all available MARC fields and subfields to map to RDF. They grouped the fields into date, international serial number, linking, name, subject, and title. They identified and linked to external data sources corresponding term URI for MeSH and LC sources. And they developed a simple interface to demonstrate the utility of searching record by Linked open data, using SPARQL query. But, they report the challenges to develop code to parse the name field because of many variations (I assume it is due to sub-codes) and different locations of the data (Cheng, 2012). But, the project was stopped in 2013, because developers had to quit due to other responsibilities.

UIUC explored adding links in MODS metadata transformed from MARCXML, transforming into non-library specific LOD-friendly semantics, such as VIAF and LCSH URIs, and deploying as RDF to maximize the utility of these records (Cole, Han, Weathers, & Joyner, 2013). But, it seems not fully map into RDF and LOD from MODS transformed from MARCXML. They report one of challenges, which “there are too many semantics options available for creating RDF representations of bibliographic records. Since the traditional library bibliographic records carrier, MARC, is not suitable for LOD and the Semantic Web environment, early experimenters of library LOD often have developed their own namespaces and semantics when publishing their catalog records as LOD data sets... As a result, there are too many semantic sets used for library LOD data sets. No single semantic set seems sufficient for describing library bibliographic catalog records” (Cole, Han, Weathers, & Joyner, 2013).

“OCLC has led major library LOD developments in recent years and now makes their bibliographic LOD-friendly records accessible on the Web. This means more than 290 million records in WorldCat can be retrieved as embedded RDFa or in other formats. OCLC LOD records are also available via content negotiations in four different formats: RDF/XML, JSON, text/turtle, and plain text” (Cole, Han, Weathers, & Joyner, 2013).

Research Objectives, Approach, Object, and Work Plan (Task 4 ~ Task 7)

The objectives of Task 4 to Task 7 are

  • to conduct Mapping Performance Experiments with conversions for new designed CT, Harvard (MARC), MIT (QDC) and UIUC (MARCXML) metadata records improving metadata interoperability at record level (March, 2014 ~ August, 2015)
    • Conducting conversions (e.g., MARC to CT, QDC to CT) in XML with Python;
    • Converting local metadata (e.g., Harvard, MIT and UIUC) into the CT;
    • Evaluating performance of indirect mappings through the CT;
    • Comparing performance of indirect mappings with direct mappings; and
    • Developing meaningful, actionable guidance and implementation strategies of mappings with the Common Terminology in order to improve metadata interoperability.
  • to conduct a Prototype of an Integrated Search Engine of Harvard, MIT and UIUC– improving metadata interoperability at repository level (September, 2015 ~ August, 2017)
    • Conceptualizing the CT in SKOS with URIs;
    • Mapping the CT of three universities to RDF with Python;
    • Conducting Linked Open Data (LOD) with Harvard, MIT and UIUC metadata;
    • Conducting an integrated search engines for LOD of Harvard, MIT and UIUC;
    • Conducting a union catalog with pulled metadata and conversion programs;
    • Conducting an integrated search engines with the union catalog; and
    • Comparing performance and effectiveness of the conducted search engines between with LOD and with union catalog.

Objects of the research

The objects of the research are everyone who is involved and interested in metadata, mapping, Linked Open Data, and interoperability. Especially, the objects of the research are focused on three university libraries (Harvard, MIT, and UIUC) and metadata schemas (MARC, MODS, DC&QDC).

Work Plan

Task 6. Mapping metadata of three universities to Linked Open Data

Objective

It is to map metadata of three universities to Linked Open Data. By rough reviews of above projects, I may present general steps to map metadata to Linked Open Data. It refers Szekely and others’ three steps for Karma system to map museum data to the Linked Data Cloud.

Work Steps

Prerequisite: Converting local metadata into the CT

Europeana uses a common dataset to achieve data interoperability among participating providers, so that they can map a reasonably useful set of metadata. It was a Dublin Core application profile with a subset of DC elements (Isaac, Clayphan, & Haslhofer, 2012). However, we have the Common Terminology that proves better performance in mapping for commonly used MARC and MODS, and DC & QDC minimizing loss of information. Thus, we can use the converted CT from original QDC (MIT), MARC (Harvard), and MARCXML (UIUC) metadata with Python conversion programs, which will have done in Task 4.

  1. SKOS

Prerequirement

1) Metadata records pertaining to digital objects provided from three universities should include embedded links to contextualization resources. These can be links to Linked Open Data (LOD) as Europeana does. It can be done by designing a checking-program that checks whether the urls in metadata exist in Harvard, UIUC (in 856 $u (MARC)) and MIT (in dc.identifier) metadata records, and whether they are broken or not. And, select a sample set of metadata records of various data types and formats in each university.

2) Ask permission and cooperation to three universities, so that we may use their metadata and objects to conduct Linked Open Data.

3) Achieve enough memory space on GSLIS web.

Define the CT in SKOS to build standard Common Terminology vocabularies on the Web. This process is to conceptualize the CT on the Web as a standard. The CT defined in SKOS (properties, subproperties, and class) will have URIs and can be used in XML and RDF. Work step 1and the Converted CT from three universities will be unique process to conduct LOD unlike Europeana and other projects. It will be great advantages to have standard semantic vocabularies/terms on the Web in conducting LOD reducing time and effort.

2. Creating links, URIs, to existing resources

In RFC 3986, a URI is defined as “a Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource” (rfc3986, 2005).

URI=scheme”:” hier-part[“?”query][“#”fragment]

Ex. Foo://example.co,:8042/over/there?name=ferret#nose

\-/ \........................../\................/\................../ \......./

scheme authority path query fragment (rfc3986, 2005)

A scheme name in URI refers to “specification for assigning identifiers within that scheme” (rfc3986, 2005) and defines URI layout and (certain) semantics.

To create URIs the following should be considered that W3C points:

“Simplicity.

Short, mnemonic URIs will not break as easily when sent in emails and are in general easier to remember, e.g. when debugging your Semantic Web server.

Stability.

Once you set up a URI to identify a certain resource, it should remain this way as long as possible. Think about the next ten years. Maybe twenty. Keep implementation-specific bits and pieces such as .php and .asp out of your URIs, you may want to change technologies later.

Manageability.

Issue your URIs in a way that you can manage. One good practice is to include the current year in the URI path, so that you can change the URI-schema each year without breaking older URIs. Keeping all 303 URIs on a dedicated subdomain, e.g. http://id.example.com/alice, eases later migration of the URI-handling subsystem” (W3C, 2008).

Examples of URI for objects in the CT project: **need more discussion

3. Map the CT to RDF (including Normalize Data and Create, and Publish RDF)

The next step is to map the converted CT metadata of three universities into RDF with conceptualized CT in SKOS with URIs. The step is described by De Boer et al. (Boer 2012) as “the process is complicated because many museums have richly-structured data including attributes that are unique to a particular museum, and the data is often inconsistent and noisy” (Szekely, Knoblock, & Wan, 2014). But, this step may not be complex as it is described by Boer, because the converted Common Terminology from different university metadata (Harvard, MIT, and UIUC) gives a uniformity to map into RDF. Mapping the data to RDF will be done by Python indicated in the below. Especially, the conceptualized CT in SKOS will give a standard uniformity on the Web. The process includes normalize data, create and publish RDF, and load the RDF in a triple store to make it available to the world.

“Mapping the data to RDF is typically done by writing rules in specialized languages such as R2RML (http://www.w3.org/TR/r2rml/) or D2RQ (http://d2rq.org/), or by writing scripts in languages such as XSLT (http://www.w3.org/TR/xslt), Python or Java. Writing mappings using these technologies is labor intensive and requires significant technical expertise.” (Szekely, Knoblock, & Wan, 2014)

4. Link to External Sources and Curate the Linked Data

This process should be discussed to decide which data hubs we will use. “Once the data is in RDF, the next step is to find the links from the metadata to other repositories from other museums or data hubs, such as DBpedia or GeoNames. (publish it on website wituh HTML)” (Szekely, Knoblock, & Wan, 2014). We may choose one of methods from Linking Algorithms to find related sources and to link:

  • String matching e.g. lexical distance between labels
  • Common key matching e.g. ISBN
  • Property-based matching
  • Aim for reciprocal links (Davis, 2009)

Curation of the data is to verify which both the published information and its links to other sources within the LOD are accurate (Szekely, Knoblock, & Wan, 2014).

5. Conducting a search engine for the built LOD

By conducting a search engine, we can measure performance and efficiency of the built LOD. We can use sqarql to extract it from LOD. We may find data from the following methods:

  • Browsing : Tabulator, VisiNav, DBpedia Mobile, etc.
  • Searching: Sindice, SWSE, Falcons, etc.
  • Mashupes, e.g., Revyu, BBC Music, DERI Pipes (Davis, 2009).

But, we will conduct an integrated search engine of UIUC, MIT and Harvard for the built LOD.

Task 7. Conducting an integrated search engine by generating a union catalog with the CT

Objective

It is to generate a union catalog with the converted CT from Harvard (MARC), MIT (QDC), and UIUC (MARCXML). And it is to conduct an integrated search engine that retrieves related items through the generated union catalog. Lastly, it is to compare the performance of search engines between for the conducted LOD and with the generated union catalog in achieving interoperability at repository level.

Work Steps

1. Conducting a union catalog with the CT

The converted CT from pulled metadata of Harvard, MIT and UIUC will be merged to create a union catalog. A union catalog generator program will be made with Python. The generator will create a relational database for the converted CT metadata from all sources (e.g., Harvard), according to the Common Terms. Redundancy will be investigated by the generator, so that they can be placed together. If redundancy is from a source, the redundant records will be merged into one. But, if the redundancy is from different sources, we will preserve all record information, so that users can choose variety sources according to their flavor. Generating the union catalog will be periodically repeated in order to be updated frequently.

2. Conducting an integrated search engine by the generated union catalog with the CT

Lastly, conducting an integrated search engine with the generated union catalog will be the highlighted process. Through the search engine, we can assure autonomy and authority of all source providers that they have rights and priority to manage and preserve their metadata and resources, and to have responsibility in keeping copyrights. The performance and advantages of this way will be measured and investigated.

3. Evaluate their performance and develop the final paper

We will finally evaluate their performance of two conducted integrated search engines for Harvard, MIT, and UIUC: one is for the conducted LOD, and the other is for the conducted union catalog. I cannot predicate the final result, but expect that the way to prove autonomy and authority will be secure and effective for the people and for improving interoperability ultimately.

Deliverables

  • Conceptualized CT in SKOS with URIs on the Web.
  • Mapped CT in RDF with created URIs.
  • Conducted Linked Open Data on the Web.
  • Conducted an integrated search engine for developed LOD.
  • Conducted an integrated search engine with generated union catalog with the CT.
  • The final paper that compare and analyze the performance and advantages vs. weakness of search engines by conducting LOD and by conducting the union catalog with the CT in improving metadata interoperability at repository level.

Last Modified April 15, 2014