We have designated a portion of the TIBKAT collection as the blind test dataset, where subject heading annotations will not be visible. Participants will receive a set of technical records of various types, matching the training dataset, in both English and German, to input into their systems. Each participant must submit a ranked list of the top 50 relevant subjects for each record, ordered by descending relevance. This approach supports subject specialists by narrowing the focus to 50 subjects, considering the practical constraints of user attention and the extensive GND subject list. The diversity of the technical records presents an opportunity to evaluate how well systems can prioritize a varied range of subjects across records, and even within a single record when applicable.
System performance was assessed using the following metrics:
Avg. Precision@k (k = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50)
Avg. Recall@k (k = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50)
Avg. F1-score@k (k = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50)
While each metric has its strengths and weaknesses, we selected avg. precision@k, recall@k, and f1-score@k, as subject tagging does not involve ranked ordering among applicable subjects—all relevant subjects are equally applicable. You can read an insightful introduction to these metrics here.
Given the nature of the LLMs4Subjects shared task dataset, evaluation scores were released at varying levels of granularity to provide sufficient insight into system performances. The evaluation granularities were:
Language-level: Separately for English and German
Record-level: For each of the five types of technical records
Combined Language and Record-levels: Detailed evaluations combining both language and record type
This approach aimed to encourage meticulous discussions on participant systems in the eventual task overview and respective system description papers. The shared task evaluation script was released for transparency in calculations.
A portion of the test dataset was sampled for manual evaluation by TIB subject specialists. The evaluation took place over a three-week cycle. During this period, test records were presented to the subject specialists via an interface that includes the predicted ranked subjects. Specialists selected all relevant subjects from the list. This setting aimed to emulate, as closely as possible, the dynamic real-world application scenario, assessing the usefulness of the system-generated results to the subject specialists. A summary of these evaluations were also be presented using the average precision@k, recall@k, and F1-score@k metrics.
With these evaluations, we aimed to offer a well-rounded perspective of the solutions submitted to the LLMs4Subjects shared task.
More details will be provided soon.