LLM-based Subject Tagging for
the TIB Technical Library's Open-Access Catalog
Theme: The Development of Energy- and Compute-Efficient LLM Systems
The 2nd LLMs4Subjects Shared Task | GermEval'25 @ Konvens 2025, Hildesheim, Germany
Theme: The Development of Energy- and Compute-Efficient LLM Systems
The 2nd LLMs4Subjects Shared Task | GermEval'25 @ Konvens 2025, Hildesheim, Germany
A portion of the TIBKAT collection has been designated as the blind test dataset, with subject domain annotations hidden. Participants will receive technical records of various types in English and German, matching the training dataset, to process with their systems. Each participant must submit a relevant list of subject domains for each record. This setup evaluates how well systems prioritize diverse subjects across different records and within individual records when applicable.
For this multi-label classification subtask, participants will submit their system outputs via Codalab. The TIBKAT annotations will be hosted on the competition platform but will remain hidden throughout the evaluation phase. System performance will be assessed using the following metrics:
Macro-Averaged Precision / Recall / F1
Micro-Averaged Precision / Recall / F1
These metrics will provide a comprehensive evaluation of how well each system identifies relevant subject domains across different records. Evaluation results will be reported at multiple levels of granularity, including:
Language level: Separately for English and German
Record-level: For each of the five types of technical records
Combined Language and Record-levels: Detailed evaluations combining both language and record type
The overall team ranking will be determined based on the F1 Score computed across all records, regardless of granularity levels. You can read an insightful introduction to these metrics here.
We have designated a portion of the TIBKAT collection as the blind test dataset, where subject heading annotations will not be visible. Participants will receive a set of technical records of various types, matching the training dataset, in both English and German, to input into their systems. Each participant must submit a ranked list of the top 20 relevant subjects for each record, ordered by descending relevance. This approach supports subject specialists by narrowing the focus to 20 subjects, considering the practical constraints of user attention and the extensive GND subject list. The diversity of the technical records presents an opportunity to evaluate how well systems can prioritize a varied range of subjects across records, and even within a single record when applicable.
Similar to subtask 1 quantitative evaluations, we will use Codalab for participants to submit their system outputs. The TIBKAT annotations will be hosted on the competition platform and will remain hidden from participants throughout the evaluation phase to ensure a fair assessment.
Evaluation metrics will be computed across multiple values of k (i.e., k = 5, 10, 15, 20) to comprehensively assess the performance of participating systems.
The primary evaluation metric used to assess system performance will be:
nDCG@k (Normalized Discounted Cumulative Gain), which will account for ranked ordering among the applicable subjects, rewarding systems that place more relevant subjects higher in the list. This metric will be the primary metric for evaluation in this iteration of the shared task and will provide a more nuanced evaluation of how well systems rank subjects in relation to the input records. More details can be found here.
In addition to nDCG@k, the following metrics will be reported as secondary metrics, allowing for comparison with previous iteration of the shared task:
Average Precision@k
Average Recall@k
Average F1-score@k
You can read an insightful introduction to these metrics here.
Given the nature of the LLMs4Subjects shared task dataset, evaluation scores will be released at varying levels of granularity to provide sufficient insight into system performances. The evaluation granularities will be:
Language level: Separately for English and German
Record-level: For each of the five types of technical records
Subject-domain-level: Evaluation across 28 different subject domains
Combined Language and Record-levels: Detailed evaluations combining both language and record type
Combined Language and Subject-domain-levels: Detailed evaluations combining both language and subject domain
Combined Record and Subject-domain-levels: Detailed evaluations combining both record type and subject domain
Combined Language, Record and Subject-domain-levels: Detailed evaluations combining language, record type and subject domain
All listed metrics will be computed across each of these granularity levels to offer a detailed understanding of system performance. In addition to the individual evaluation granularity, the overall teams ranking will be calculated based on the average nDCG score across all records, irrespective of the different granularity scores.
This approach aims to encourage meticulous discussions on participant systems in the eventual task overview and respective system description papers.
A portion of the test dataset will be sampled for manual evaluation by TIB subject specialists. The evaluation will take place over a one- or two-week cycle. During this period, test records will be presented to the subject specialists via an interface that includes the predicted ranked subjects. Specialists will have the ability to select all relevant subjects from the list as well as add any missing ones. This setting aims to emulate, as closely as possible, the dynamic real-world application scenario, assessing the usefulness of the system-generated results to the subject specialists. A summary of these evaluations will also be presented using the average precision@k, recall@k, F1-score@k and nDCG@k metrics.
With these evaluations, we aim to offer a well-rounded perspective of the solutions submitted to the LLMs4Subjects shared task. We encourage participants to develop high-quality semantic subject comprehension systems for the GND taxonomy. While you may use the TIBKAT training set annotations, we also welcome systems that recognize limitations in these annotations and focus solely on achieving a thorough understanding of GND subjects.
We are hosting submissions and evaluation through Codabench platform. You can submit your predictions for each subtask and view the leaderboard here: https://www.codabench.org/competitions/8373/. Please follow the submission format guidelines provided in the "evaluation" section on Codebench or https://github.com/sciknoworg/llms4subjects/tree/main/submission-format .