To assess the performance of submitted solutions, participant's systems were evaluated through both quantitative and qualitative assessments to ensure a comprehensive understanding of their performance.
The quantitative evaluation focused on precision, recall, and F1 scores at various thresholds (k = 5 to 50) for two dataset categories: all-subjects and tib-core. Systems were ranked based on their average recall scores across the specified thresholds, emphasizing the importance of retrieving relevant subjects.
For the qualitative assessment, 14 distinct subject classifications were utilized: Architecture (arc), Chemistry (che), Electrical Engineering (elt), Material Science (fer), History (his), Computer Science (inf), Linguistics (lin), Literature Studies (lit), Mathematics (mat), Economics (oek), Physics (phy), Social Sciences (sow), Engineering (tec), and Traffic Engineering (ver). Within each classification, 10 record files were selected, and the top 20 GND codes from participants' submissions were extracted. Subject librarians meticulously evaluated these codes to assess their relevance and accuracy.
During the qualitative evaluation, subject librarians marked the predictions based on the following codes: Y: Yes, correct keyword -- I: Irrelevant keyword, but technically correct -- N or Blank: Incorrect. Based on these codes, two different qualitative results were computed. In the first case, both Y and I were considered correct, while in the second case, only Y was considered correct. These two cases were computed in separate files to provide a comprehensive view of system performance.