Improving Metadata Interoperability at the Record Metadata Model Level

Conducting the Mapping Experiments with the Developed CT

To improve metadata interoperability at the record level, the conversion is designed with Python language to convert MIT (QDC) records into the Common Terminology 1.1 in March and modified in July, 2014. The conversion program also measures transfer, non transfer, lexical and semantic match rates. As a result of MIT (QDC) to CT mapping experiment with 20,000 QDC records, total transfer rate is very high with 99.99537%. No transfer rate 0.00463% means that loss of information rate is extremely low in the mapping MIT (QDC) to CT. Total lexical match rate is 98.7%, improving significantly lexical interoperability. Total semantic match rate is perfectly 100%.

Another conversion for UIUC (MARCXML) to CT 1.1 was developed during September 2014. The result confirms surprisingly much better performance of the developed CT. Although CT is developed as a bridge of MARC, MODS, DC and QDC, I couldn't expect good mapping results with MARC records, because the mappings are from 1000 MARC tags and many subcodes to CT that has only 12 common terms (less than Dublin Core) and 58 qualifiers (many fewer than MARC tags). However, the result of MARC to CT mapping experiment with 400,000 MARCXML records of UIUC library shows very high 95.27% transfer rate, 4.729% non transfer rate (loss of information rate), 100% semantic match rates (exact Match rate: 54.2% and broadMatch: 45.79%) by SKOS concept. The perfect match rate matched by all of MARC tag, indexes and subcode is 16.1447%; subcode match rate by tag and subcodes is 42.33 %; and general match rate for MARC tag number only is 41.525%.

In light of the result of MIT to CT mapping experiment (transfer 99.9%, lexical 98.7%, semantic match rate 100%, and loss of information rate 0.00463%) and the result of UIUC (transfer 95.27%, semantic match rate 100%, and loss of information rate 4.729%), we conclude that CT shows higher performance in achieving and improving metadata interoperability, minimizing loss of information and preserving the specificity and precision of the source metadata records.

These successful results are parts of mapping experiments with conversions involving Harvard (MARC), MIT (QDC) and UIUC (MARCXML) metadata records. The other conversion for Harvard (MARC) to CT 1.1 is under development. It is to achieve and improve metadata interoperability at the record level among three universities’ libraries and among MARC, QDC, and CT. This part was not possible without three universities’ cooperation. I really appreciate Harvard, MIT and UIUC persons who cooperate for CT project providing their metadata records. More detail explanations about the mapping experiments are described in the below.

I. The Provided Metadata Records from Harvard, MIT, and UIUC

Co-Director David Weinberger of Harvard Law Library Innovation Lab provided the link of Harvard MARC records, http://openmetadata.lib.harvard.edu/bibdata, on May, 2013. According to him, they are “the complete MARC records for over 99% of all the works in Harvard Library.” Total 12 million MARC records were provided as 14 files. UIUC 10 million MARCXML records are provided through Myung-Ja Han, Metadata Librarian and Associate Professor by the link, https://uofi.box.com/s/77xpmaavo16xopqswqvj. It was generated in 2010 with 89 xml files. Harvard and UIUC records were used to investigate MARC tag usage and to build the crosswalk MARC to CT that are described in Chapter 3. They will be used to conduct the mapping experiment with the conversion, Harvard (MARC) to CT. The sample metadata records of Harvard and UIUC are the following.

Harvard MARC Metadata Records

=LDR 00635nam a2200205Ki 4500

=001 007000001-8

=005 20020606163548.3

=008 960730s1975\\\\un\\\\\\\\\\\\000\0\rus\d

=035 0\$aocm82110934

=040 \\$aHLS$cHLS

=090 \\$aDS33.2$b.D45x

=100 1\$aTambovskii, Konstantin Ivanovich.

=245 10$aK beregam Kua-Kam /$cK. Tambovskii.

=260 \\$aOdessa :$bMaiak,$c1975.

=300 \\$a94 p. :$bill. ;$c17 cm.

=490 0\$aGlazami sovetskikh moriakov

=690 \9$aMerchant marine$zRussia$xPersonal narratives.$5wid

=690 \9$aVietnamese Conflict, 1961-1975$xPersonal narratives$xRussian.$5wid

=988 \\$a20020608

=906 \\$0MH

UIUC MARCXML Metadata Records

<?xml version="1.0" encoding="UTF-8"?>

<collection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd" xmlns="http://www.loc.gov/MARC21/slim">

<record>

<leader>00655nam 2200193 i 4500</leader>

<controlfield tag="001">200001</controlfield>

<controlfield tag="005">20020415161359.0</controlfield>

<controlfield tag="008">770519s1965 ilu 00000 eng d</controlfield>

<datafield tag="035" ind1=" " ind2=" ">

<subfield code="a">(OCoLC)ocm02977467</subfield>

</datafield>

<datafield tag="035" ind1=" " ind2=" ">

<subfield code="9">AAV-1954</subfield>

</datafield>

<datafield tag="040" ind1=" " ind2=" ">

<subfield code="a">VVB</subfield>

<subfield code="c">VVB</subfield>

<subfield code="d">UIU</subfield>

</datafield>

<datafield tag="245" ind1="0" ind2="2">

<subfield code="a">A rating system to improve job performance.</subfield>

</datafield>

<datafield tag="260" ind1="0" ind2=" ">

<subfield code="a">Chicago :</subfield>

<subfield code="b">Public Personnel Association,</subfield>

<subfield code="c">1965.</subfield>

</datafield>

<datafield tag="300" ind1=" " ind2=" ">

<subfield code="a">12 p. ;</subfield>

<subfield code="c">28 cm.</subfield>

</datafield>

<datafield tag="410" ind1="2" ind2="0">

<subfield code="a">Public Personnel Association.</subfield>

<subfield code="t">Personnel report ;</subfield>

<subfield code="v">no. 651</subfield>

</datafield>

<datafield tag="500" ind1=" " ind2=" ">

<subfield code="a">Cover title.</subfield>

</datafield>

<datafield tag="650" ind1=" " ind2="0">

<subfield code="a">Public schools</subfield>

<subfield code="z">California</subfield>

<subfield code="z">San Diego.</subfield>

</datafield>

<datafield tag="650" ind1=" " ind2="0">

<subfield code="a">Employees</subfield>

<subfield code="x">Rating of.</subfield>

</datafield>

<datafield tag="955" ind1=" " ind2=" ">

<subfield code="a">UIU</subfield>

<subfield code="b">38888113432219</subfield>

<subfield code="c">352692</subfield>

<subfield code="d">Stacks</subfield>

<subfield code="e">351.1 P963P</subfield>

<subfield code="f">1</subfield>

<subfield code="g">am</subfield>

<subfield code="h">v.651(1964)</subfield>

</datafield>

</record>

MIT QDC Metadata Record

In the case of MIT Library, it is somewhat different. According to MIT digital library systems manager Carl Jones who provided Qualified Dublin Core, OAI harvesting for DSpace metadata only gives simple unqualified Dublin Core. To get qualified Dublin Core via OAI, they should change a DSpace configuration setting. They did change it and provided the URL to harvist QDC from dome-dev.mit.edu, http://dome-dev.mit.edu/oai/request?verb=ListRecords&metadataPrefix=qdc&set=hdl_1721.3_82443. He said this is “one of our production repositories, dome.mit.edu. Dome contains mostly image records but also some text items, including "digital objects" from the Institute Archives. I chose dome-dev because it was easy for me to test the configuration changes needed to output QDC. The change required a Tomcat application server restart, which is only done at scheduled intervals on our production servers.” Also, three csv files were provided for Sloan Working Papers, Open Access Articles and one MIT theses community for the Dept. of Urban Studies and Planning “which contains undergrad, masters, and PhD thesis collections.” Total 20 thousand QDC records were harvested via OAI as 28 xml files and 3 csv files. They were used to investigate QDC element names usage and to build the crosswalk (Q)DC to CT that are described in Chapter 3. They become the foundation to conduct the mapping experiment with the conversion from MIT (QDC) to CT with MIT (QDC) to CT crosswalk, measuring transfer, lexical match, and semantic match rates.

<record>

<header>

<identifier>oai:dome-dev.mit.edu:1721.3/44226</identifier>

<datestamp>2012-07-24T15:59:28Z</datestamp>

<setSpec>hdl_1721.3_44225</setSpec>

</header>

<metadata>

<dcterms:spatial xmlns:dcterms="http://purl.org/dc/terms/"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd"xml:lang="en_US">Site: Chicago (Illinois, United States)</dcterms:spatial>

<dcterms:temporal xmlns:dcterms="http://purl.org/dc/terms/"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd"xml:lang="en_US">creation date: 1890-1892</dcterms:temporal>

<dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd"xml:lang="en_US">Rockefeller, John D</dc:creator>

<dc:date xmlns:dc="http://purl.org/dc/elements/1.1/"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd"xml:lang="en_US">1890-1892</dc:date>

<dcterms:dateAccepted xmlns:dcterms="http://purl.org/dc/terms/"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd">2009-10-16T21:12:49Z</dcterms:dateAccepted>

<dcterms:available xmlns:dcterms="http://purl.org/dc/terms/"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd">2009-10-16T21:12:49Z</dcterms:available>

….

II. The Mapping Experiment with 20,000 QDC records of Massachusetts Institute of Technology (MIT)

The conversions base on the developed crosswalks (e.g., MARC to CT and (Q)DC to CT). Especially, with QDC to CT crosswalk, MIT (QDC) to CT crosswalk like the below is designed in order to develop the conversion that converts (Q)DC records of MIT metadata to CT. The full crosswalk is in the link.

2. The Designed Conversion Program in Python

The designed conversion Python program is not only converting MIT QDC records into CT, but also measuring transfer rate, lexical match rate, and semantic match rates together. The reported rates measure the percentage of elements over every metadata statement in the input records that are mapped and not mapped to CT. First, to convert MIT QDC to CT, CT namespaces are defined like the below to be validated in XML:

<CT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"'

xsi:schemaLocation=”http://courseweb.lis.illinois.edu/~sunjin/CT/1.1/ http://courseweb.lis.illinois.edu/~sunjin/CT/1.1/ct.xsd

xmlns="http://courseweb.lis.illinois.edu/~sunjin/CT/1.1/">

MIT QDC records in an xml file are split by ‘<record>’ to measure how many records are in the file. Total records are 20278 in 31 files: 28 xml files and 3 csv files. Because of two different file formats, two functions are designed: MITQDCtoCTconversion and MITQDCcsvtoCTconversion. To retrieve element names and contents, the split module of Python is used. Since there are long urls to link and define elements as above MIT QDC records show, different methods are used to retrieve them. Also, the transferred values from MIT QDC include special characters that XML cannot validate. Thus, these character were changed: '&' into '&amp;', '<' into '&lt;', and '>' into '&gt,' before being ransferred. The transferred CT from MIT QDC is saved into new xml files. The full conversion program that includes details how to measure transfer, lexical match, and sematic match rates is the designed conversion program.

3. Methodology to Measure Lexical and Semantic Match Rates

Based on the MIT(QDC) to CT crosswalk, element names of QDC considered at schema level are reconsidered for the mapping experiment of MIT(QDC) to CT. It is to clarify which one is lexically and semantically perfectly or partially matched with the Common Terms (properties) and qualifiers (sub-properties) of the Common Terminology. To measure lexical and semantic match rates, the degree how much an element name is exactly matched lexically and semantically is measured. The rates measure the percentage of elements over every metadata statement in the input records that are mapped to CT. As Table 5 shows, an example of the perfect lexical and semantic matches is dc.contributor.author (MIT) mapped into CT:contributor role="author" authority="LCMARCrelators". The author role is defined in authority LCMARCrelators. In this case, obviously both contributor and author terms of QDC are exactly lexically and semantically matched with CT. It is counted as a perfect lexical and semantic match. Especially, 'isreplacedby' of ‘dc.relation.isreplacedby' mapped into CT: relation type="replacement," is considered as the lexically and semantically same term with ‘replacement.’ Similarly, 'requires' of ‘dc.relation.requires’ is the same with ‘requirement.’ As an example of partially lexically and semantically matched, dc.contributor.approver (MIT) is mapped into CT: contributor role="other" authority="LCMARCrelators". Only contributor in QDC is exactly lexically and semantically matched with CT. It is considered as a partial lexical and semantic match. Although a term is not totally matched lexically, the term is rechecked whether it is semantically related perfectly or partically. Since CT is chosen to maximize lexical and semantic interoperability, all terms of QDC are semantically perfectly or partically matched. The following is a part of conversion program measuring lexical match rates. The full program is in the link.

4. The Results of the Mapping Experiment with MIT (QDC) Metadata Records

According to the conversion program based on the MIT(QDC) to CT crosswalk, the converted CT for above sample MIT (QDC) record is as follows.

4.1. The Converted CT for a sample MIT (QDC) record

<?xml version="1.0" encoding="UTF-8"?>

<!--filename:MIT01.xml records:100 totalMITrecords:100-->

<CT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://courseweb.lis.illinois.edu/~sunjin/CT/1.1/ http://courseweb.lis.illinois.edu/~sunjin/CT/1.1/ct.xsd"

xmlns="http://courseweb.lis.illinois.edu/~sunjin/CT/1.1/">

<identifier source="MIT">oai:dome-dev.mit.edu:1721.3/44226</identifier>

<date type="other">2012-07-24T15:59:28Z</date>

<identifier type="collection">hdl_1721.3_44225</identifier>

<subject type="spatial">Site: Chicago (Illinois, United States)</subject>

<subject type="temporal">creation date: 1890-1892</subject>

<contributor role="creator" authority="LCMARCRelators">Rockefeller, John D</contributor>

<date>1890-1892</date>

<date type="available">2009-10-16T21:12:49Z</date>

<date type="issued">1890-1892</date>

<identifier source="MIT">130578</identifier>

<identifier source="MIT">http://hdl.handle.net/1721.3/44226</identifier>

<description>aerial view, Midway Plaisance and campus, 5/28/2005</description>

<format type="medium">concrete</format>

<relation type="isPartOf">138234</relation>

<rights>© Alex S. MacLean / Landslides</rights>

<subject>Land use, Urban</subject>

<subject>Universities and colleges</subject>

<subject>Architecture --United States</subject>

<subject>College campuses</subject>

<subject>University of Chicago</subject>

<subject>Athletic fields</subject>

<subject>Aerial photography --United States</subject>

<title>University of Chicago</title>

<typeGenre authority="DCMItype">Image</typeGenre>

It shows much clearly data of MIT (QDC) with the generalized and concised Common Terminology. It can be very easily understood and described by anyone. It is the most strong point of the developed CT. It also describes visibly where the resource comes from with source=”MIT” in CT:identifier. Using authorities and qualifiers (sub-properties) such as authority="DCMItype," the value of CT can be defined and limited by the described authorities and qualifiers. The qualifiers allow us describing detail like MARC and MODS. They play obviously a bridge role between MARC (detailed) and (Q)DC. More converted CT is in an Example of the converted CT from the MIT(QDC) xml file.

4.2. The Transfer rate, Lexical and Semantic Match Rates

The designed program gives statistic data for transfer rate, lexical and semantic match rates. Total records of MIT (QDC) are 20278 records. Since the qualifiers of CT 1.1 are developed to meet communities’ needs like QDC, the transfer rate of MIT (QDC) to CT 1.1 mapping is very high with 99.99537%. No transfer rate 0.00463% means loss of information rate in mapping MIT (QDC) to CT. Only eprint.grantNumber elements in MIT QDC records were not transferred that has no mapping into CT. It is significantly low loss of information rate that hasn’t happened in the metadata field.

Total lexical match rate including perfect and partial matched rates is very high rate, 98.7%, in the MIT (QDC) to CT 1.1 mapping experiment. It is a very high lexical match rate that improves significantly lexical interoperability. In detail, the perfect lexical match rate is 54.02% that all terms of CT are matched with both QDC elements and qualifiers (e.g., ‘description.abstract’: ’description type="abstract",’ ‘contributor.author’:’ contributor role="author" authority="LCMARCRelators",’ etc.). The partial lexical match rate is 44.707%, matched by either elements or qualifiers (e.g., 'contributor.department': ‘contributor name="corporate",’ 'coverage.spatial':’ subject type="spatial",’ etc.). No lexical match rate is very low, 1.265 such as 'dc.source':'relation type="original"' not matched lexically at all. However, lexically partially and no matched terms of QDC are reinvestigated whether they are semantically related.

Total semantic match rate is perfectly 100% including perfect and partial sematic match. In detail, the perfect semantic match rate is high 85.836% including the mappings such as 'dc.date.created': 'date type="issued"', 'dc.description.statementofresponsibility': 'rights type="holder".' Although they are very different lexically, they are semantically matched. The partial semantic match rate is very low, 14.164% including ‘dc.contributor.advisor’: ‘contributor role="other" authority="LCMARCRelators".’ The improved interoperability lexically and semantically means CT 1.1 minimizes loss of information at schema and record levels. Also, it means CT 1.1 reduces significantly the gap of generality and specificity degrees among the selected four standards (MARC, MODS, DC and QDC).

III. Another Mapping Experiment with 400,000 MARCXML records of University of Illinois at Urbana-Champaign (UIUC) Library

Another mapping experiment is done with 400,000 MARCXML records of UIUC library. The conversion for MARCXML to CT mapping is developed in Python language during September 2014. The conversion also measures transfer, non-transfer, degree of match rates, and semantic match rate. Through the result of the experiment, we conclude that CT shows higher performance in achieving and improving metadata interoperability, minimizing loss of information and preserving the specificity and precision of the source metadata records. The more details are explained in the MARCXML to CT Mapping Experiment web page.

Last Modified October 14, 2014