Metadata Wrangling Workflow

Process for metadata wrangling

Metadata was submitted in four different formats:

    1. Controlled vocabulary list exported from CONTENTdm that would include one line per value. These files were submitted as *.txt files. In addition to having the authorized values, there were some cross references listed in these types of files in the format of "Smith, John USE Smith, John, 1800-1880"
    2. Full collection metadata export from CONTENTdm submitted as a *.txt file. Since these metadata files included all of the descriptive and technical metadata from the CONTENTdm collections, we had to identify the specific fields that were applicable to this project and included personal names or corporate bodies. To extract the data that was useful, we opened the text files in Excel and deleted the extra descriptive information.
    3. Spreadsheet of all names used in all CONTENTdm collections from that repository. Since this information was also duplicated in the collection exports that were sent, we didn't use the data from the spreadsheet.
    4. JSON-LD file of all names in repository. Since the majority of names that we received for this project were in a text file format, we converted the JSON-LD files to csv using http://www.convertcsv.com/json-to-csv.htm

After all metadata had been submitted, we extracted all of the personal names and corporate bodies from the different files and compiled into one large Excel sheet. When possible, we retained information about the institution that submitted the name, the collection the name was used in, the metadata field with that name, and then the type of name (whether personal name or corporate body). This resulted in a list of nearly 500,000 names with many duplicates.

In order to retain the information about the institution, collection, and field name when the names were de-duplicated, we concatenated all of that information into one column of the spreadsheet in the format "[Institution code];[collection code];[field name]" (e.g. BYU;DUP;personal names). When the names were de-duplicated, this information was all compiled into one field separated by [space][double dash][space] (e.g. UU;EAD;n/a -- USU;Lee;creator).

In order to deduplicate the data for exact matches of names, a formula in Excel was used to compare the value of a cell with the value of the cell directly above it.

=IF(A2=A1,"",A2)

This formula does the following: if cell A2 is an exact match of cell A1, then print nothing (the empty quotes), otherwise enter the value of A2. After this formula was used on all of the names, the insitution/collection/field data was concatenated for all rows that had names that were exactly the same. This de-duplication reduced the number of names to 76,360.

After de-duplicating the data, we gathered some basic statistics to see how many institutions used each name, how many collection included the names, and how many states had an institution using the names.

The next metadata wrangling step was to reconcile the names with the Library of Congress Name Authority File using OpenRefine's reconciliation service.

The reconciliation workflow is based on Matt Carruthers's LCNAF-Named-Entity-Reconciliation.