Process for metadata wrangling
Metadata was submitted in four different formats:
After all metadata had been submitted, we extracted all of the personal names and corporate bodies from the different files and compiled into one large Excel sheet. When possible, we retained information about the institution that submitted the name, the collection the name was used in, the metadata field with that name, and then the type of name (whether personal name or corporate body). This resulted in a list of nearly 500,000 names with many duplicates.
In order to retain the information about the institution, collection, and field name when the names were de-duplicated, we concatenated all of that information into one column of the spreadsheet in the format "[Institution code];[collection code];[field name]" (e.g. BYU;DUP;personal names). When the names were de-duplicated, this information was all compiled into one field separated by [space][double dash][space] (e.g. UU;EAD;n/a -- USU;Lee;creator).
In order to deduplicate the data for exact matches of names, a formula in Excel was used to compare the value of a cell with the value of the cell directly above it.
=IF(A2=A1,"",A2)
This formula does the following: if cell A2 is an exact match of cell A1, then print nothing (the empty quotes), otherwise enter the value of A2. After this formula was used on all of the names, the insitution/collection/field data was concatenated for all rows that had names that were exactly the same. This de-duplication reduced the number of names to 76,360.
After de-duplicating the data, we gathered some basic statistics to see how many institutions used each name, how many collection included the names, and how many states had an institution using the names.
The next metadata wrangling step was to reconcile the names with the Library of Congress Name Authority File using OpenRefine's reconciliation service.
The reconciliation workflow is based on Matt Carruthers's LCNAF-Named-Entity-Reconciliation.