LLMs4Subjects is the first shared task of its kind, challenging the research community to develop cutting-edge LLM-based solutions for subject tagging of technical records from Leibniz University’s Technical Library (TIBKAT). Participants are tasked with leveraging large language models (LLMs) to tag technical records using the GND taxonomy. The task involves bilingual language modeling, as systems must process technical documents in both German and English. Successful solutions may be integrated into the operational workflows of TIB, the Leibniz Information Centre for Science and Technology.
TIB, the Leibniz Information Centre for Science and Technology and University Library, champions free knowledge access, information sharing, and open scientific publications and data. As the German National Library for Science & Technology—as well as for Architecture, Chemistry, Computer Science, Mathematics, and Physics—it maintains a globally unique collection, including audiovisual media and research data.
TIBKAT is TIB’s open access bibliographic database for science and technology. All metadata generated for TIBKAT, including freely available electronic collections, are released under the CC0 1.0 Universal Public Domain Dedication license. This allows free use of these materials. More information can be found at TIB Open Data Services (https://www.tib.eu/en/services/open-data), particularly in the section "TIBKAT data and Metadata of freely available electronic collections."
The GND (Gemeinsame Normdatei in German or Integrated Authority File in English) is an international authority file used mainly by German-speaking libraries to catalog and link information about people, organizations, topics, and works. It is publicly available for download in various formats under a CC0 license. The GND's records cover entities such as persons, corporate bodies, conferences, geographic locations, subject headings, and works relevant to cultural and scientific collections.
For the LLMs4Subjects shared task, only the GND subject heading (Sachbegriff) records are relevant. To simplify access for participants, we provide a preprocessed, human-readable GND taxonomy available for direct download. More information is available on the Data and Tasks page.
Participants in the LLMs4Subjects shared task are invited to develop LLM-based systems that recommend the most relevant subjects from the entire GND subjects collection to tag a given TIB technical record. The input to the systems will be a technical record's title and abstract, and the expected output is a customizable top-k list of relevant GND subjects. Since input technical records can be either in English or German, systems should be capable of bilingual semantic language processing.
LLMs4Subjects aims to explore the untapped potential of LLM-based solutions for subject classification or tagging. The task is based on the open-access collection of the TIB. As mentioned earlier, TIB's open-access collection, TIBKAT, comprises over 100,000 record types such as technical reports, publications, and books, available mainly in English and German, and classified according to the GND subjects taxonomy.
The opportunities presented by LLMs for subject classification include their ability to comprehend natural language at an unprecedented scale and depth, enabling nuanced semantic distinction and classification of complex and interdisciplinary subjects. This capability can significantly enhance the accuracy and efficiency of organizing large collections, directly impacting the accessibility and discoverability of information.
The solutions developed for this shared task can influence the application of LLMs in modern digital library systems, promoting innovation and setting new standards in the field, similar to projects like Annif (https://annif.org/). Furthermore, this shared task is highly relevant to the SemEval series because it evaluates a novel application of computational semantics crucial for organizing and accessing information efficiently. For details on the planned tasks in this first iteration of LLMs4Subjects, please visit the "Data & Tasks" page here or via the menu.