Test Collection

We are now releasing SUSHI Test Collection for DryRun. You can download the test collection files with links below.

Subtask A: Searching Undigitized Content

Formal-runTest Collection Links:

Formal-run experiment control file (json file, version 1.1)
JSON for item-level metadata (json file, version 1.2)
JSON for folder-level metadata (json file, version 1.2)
SUSHI SNC translation file (MS-Excel file, version 1.3)
Qrels files for the 45 topics (dcoument, folder, box)
Qrels files for the 9 topics by sub assessors (dcoument, folder, box)

Dry-run Test Collection Links:

SUSHI document collection (Please be careful. This is a large 27GB zip file)
SUSHI Subtask A collection metadata file (MS-Excel file, version 1.1 updated on June 8)
Dry-run experiment control file (json file, version 1.1 updated on June 4)
Subtask A validation code (Python program)
Subtask A evaluation code (Python program, version 1.2 updated on June 30)
Subtask A folder qrels file (tsv file, version 1.1)
Subtask A box qrels file (tsv file, version 1.1)

Subtask B: Detecting Archival References

Formal-Run Test Collection Links:

Dry-Run Test Collection Links:

Dry-run citation file
Subtask B baseline & evaluation code (ipynb file, version 0.1 created on August 13)
Subtask B qrels file (tsv file, version 0.2 updated on August 13)
- This version of the Qrels only supports evaluation for archive reference detection and does not support evaluation for the span detection of an archive reference. The positions of the first and last characters correspond to those of a citation.

Subtask B: Annotations from Participants

File Links:

Manual annotations of the positives in the dry-run citations (json), created by Doug Oard
GPT-4o mini's annotations of the formal-run citations (tsv) created by Tokinori Suzuki
Manual annotation of the positives in the GPT-4o mini's annotations (json), created by Doug Oard and Tokinori Suzuki

Descriptions:

Manual Annotation of the Positive Examples in the Dry-Run Citations

Manually annotated the postitive examples in the qrel file in the Dry-Run citations based on the annotation rule in the bottom. In the json file, "1" indicates a positive citation and "0" indicates a negative citation.

GPT-4o mini Annotations of a portion of the Formal-Run Citations

Automatically annotated the first 10,604 citations in the Formal-Run citations using GPT-4o mini. The last column shows the judgements: "POS" for a positive citation and "NEG" for a negative citation. The prompt used for the GPT model is shown in the bottom.

GPT-4o mini Annotations of a portion of the Formal-Run Citations

Manually annotated the postitive outputs of the GPT-4o mini based on the annotation rule in the bottom. In the json file, "1" indicates a positive citation and "0" indicates a negative citation.

Doug's Annotation Rule

The basic annotation rule was that only citations that the annotor could recognize as being expected to be useful (in the context of the original paper from which the citation was extracted) for finding the LOCATION of specific information object in an archival repository would be marked as relevant.

According to this rule:

- A citation was marked as an archival reference if any part of the citation met the definition of an archival reference.

- An information object is a physical or digital container whose main purpose is to convey information. Examples include documents, films, photographs, or audio recordings. Physical objects that principally serve a different purpose (e.g., statues, jewelry, biological specimens, or artwork) were not considered information objects, even if they incidentally included information (e.g., as inscriptions on statues or signatures on paintings).

- A location in an archival repository could be general (the name of an archival repository) or specific (e.g., Fonds, Box or Folder).

- A location need not be complete to be useful because it was reasonable to assume that specific locations would have been contextualized within the document from which the citation had been extracted.

- The annotator was a native speaker of English with some knowledge of Spanish. The citation did not need to be written in English if the annotator could recognize that it specified a location in an archival repository.

- Content that the annotator (a native speaker of English) could not understand because of the character set (e.g., Cyrillic) would not be marked as archival references.

- Content that did not specify a location in an archive would be marked as not an archival reference, even if it described a kind of content (e.g., a very old document) that would likely be found in an archival repository, because only citations containing LOCATIONS in an archival repository were marked as archival references.

- Citations to documents were not marked as an archival reference unless it was clearly stated where that documents was located in an archival repository.

- Citations that described a document in which archival references could be found (e.g., the catalog of an archive) were marked as not an archival reference.

- Citations to information objects that were PRODUCED BY an archival repository were not marked as archival references. For example, if the US National Archives produced a pamphlet for public distribution, or if the British Library published a book, neither case would be considered an archival reference, even if it was clear that the pamphlet or the book was available from that archival repository.

- Cases in which a guess was needed (e.g., when an archive was named only using an acronym that the annotator did not recognize and when some suggestive sequence of letters and numbers might or might not have been a location in an archive) were marked as not an archival reference.

- Citations to published work that included content that had been obtained originally from an archive were not marked as archival references unless the location in an archival repository from which those materials had been obtained was included in the citation.

Prompt for GPT-4o mini

I would like to determine if a given text contains references to archival materials. Please follow these instructions:

If the text contains references to archives, respond with "Yes."

Analyze the references to identify any specific details, including:

Positions of the start and end of the reference in the text

Box number

Folder

Record Group

Names of series

Any other relevant archival information

If any of these details are present, return the information in JSON format.

If the text does not contain any references to archives, simply respond with "No."

Here is the text: [CITATION TEXT]

Page updated

Google Sites

Report abuse