LLM-based Subject Tagging for
the TIB Technical Library's Open-Access Catalog
Theme: The Development of Energy- and Compute-Efficient LLM Systems
The 2nd LLMs4Subjects Shared Task | GermEval'25 @ Konvens 2025, Hildesheim, Germany
Theme: The Development of Energy- and Compute-Efficient LLM Systems
The 2nd LLMs4Subjects Shared Task | GermEval'25 @ Konvens 2025, Hildesheim, Germany
LLMs4Subjects challenges the research community to develop cutting-edge LLM-based solutions for subject tagging of technical records from Leibniz University’s Technical Library (TIBKAT). Participants are tasked with leveraging large language models (LLMs) to tag technical records using the GND taxonomy. The task involves bilingual language modeling, as systems must process technical documents in both German and English. Successful solutions may be integrated into the operational workflows of TIB, the Leibniz Information Centre for Science and Technology.
TIB, the Leibniz Information Centre for Science and Technology and University Library, champions free knowledge access, information sharing, and open scientific publications and data. As the German National Library for Science & Technology—as well as for Architecture, Chemistry, Computer Science, Mathematics, and Physics—it maintains a globally unique collection, including audiovisual media and research data.
TIBKAT is TIB’s open-access bibliographic database for science and technology. All metadata generated for TIBKAT, including freely available electronic collections, are released under the CC0 1.0 Universal Public Domain Dedication license. This allows free use of these materials. More information can be found at TIB Open Data Services (https://www.tib.eu/en/services/open-data), particularly in the section "TIBKAT data and Metadata of freely available electronic collections."
The TIBKAT open-access bibliographic database organizes its records into 28 predefined domains, with each record potentially assigned to multiple domains. This classification is based on the Fachsystematik LinSearch, a domain-specific taxonomy developed using Annif, an open-source toolkit for automated subject indexing. More information about the Fachsystematik LinSearch domains is available on the TIB Terminology Service platform (https://terminology.tib.eu/ts/ontologies/linsearch).
The GND (Gemeinsame Normdatei in German or Integrated Authority File in English) is an international authority file used mainly by German-speaking libraries to catalog and link information about people, organizations, topics, and works. It is publicly available for download in various formats under a CC0 license. The GND's records cover entities such as persons, corporate bodies, conferences, geographic locations, subject headings, and works relevant to cultural and scientific collections.
For the LLMs4Subjects shared task, only the GND subject heading (Sachbegriff) records are relevant. To simplify access for participants, we provide a preprocessed, human-readable GND taxonomy that is available for direct download. More information is available on the Data and Tasks page.
Participants in the LLMs4Subjects shared task are invited to develop LLM-based systems for enhancing domain classification and indexing of technical records from the entire GND subjects collection to tag a given TIB technical record. The input to the systems will be a technical record's title and abstract, and the expected output will vary depending on the specific subtask. For subject classification, the expected output would be one or more of the 28 predefined domains. However, for subject indexing, the expected output is a customizable top-k list of relevant GND subjects. Since input technical records can be either in English or German, systems should be capable of bilingual semantic language processing. The dataset for subtask 1 will be derived from the datasets used in subtask 2, ensuring consistency across classification and indexing efforts.
LLMs4Subjects aims to explore the untapped potential of LLM-based solutions for subject classification or tagging. The task is based on the open-access collection of the TIB. As mentioned earlier, TIB's open-access collection, TIBKAT, comprises over 100,000 record types such as technical reports, publications, and books, available mainly in English and German, and classified according to the GND subjects taxonomy.
The opportunities presented by LLMs for subject classification include their ability to comprehend natural language at an unprecedented scale and depth, enabling nuanced semantic distinction and classification of complex and interdisciplinary subjects. This capability can significantly enhance the accuracy and efficiency of organizing large collections, directly impacting the accessibility and discoverability of information.
The solutions developed for this shared task can influence the application of LLMs in modern digital library systems, promoting innovation and setting new standards in the field, similar to projects like Annif (https://annif.org/). Furthermore, this shared task evaluates a novel application of computational semantics crucial for organizing and accessing information efficiently. For details on the planned tasks in this first iteration of LLMs4Subjects, please visit the "Data & Tasks" page here or via the menu.