1. Cleaning up Excel File

Individual pages of a publication will be in its own row containing a title and the text. The metadata for each publication will be found in the row after the last page of the publication. This will contain the title, contributor, subject, Watsonline record, etc.The text will need to be exported and then merged to create one text file for each publication. We will use the Watsonline record number to name the exported text. The exported text should following this scheme, [bibrecord]-[x].txt.

  1. Copy the Watson online record and paste it into the empty cell directly above it (the metadata comes after the last page, and not before the first)
  2. Add a “-1” after it ie: b12345678-1
  3. Use auto-fill to populate the remaining pages (remember to drag up, not down)

The filename for the first page should reflect the total number of pages in the publication (If there are 148 pages, the first page should be b12345678-148). Repeat the process for the remaining publication in the Excel file.

The order does not matter since it is being text-mined and not read by people. But if you want to keep it in order, just start with the first page instead of the last.

Note on multi-volumes:

Multi-volume will usually be designated with _v1, _v2 and so forth, appended to page name. If the items following one another in the Excel document, treat it as one item and auto-fill both item. The total number of page should reflect the total number of pages in the set and not the individual volume.

If an item has a volume designation but isn't preceded or proceeded by another, add the volume destination when creating the filename scheme [bibrecord]-[_v1]-[x].txt.