Given a user question and an LLM response with its gold citation spans, classify each citation with a binary label: relevant or non-relevant. A citation is relevant only if it directly answers the question. All other cases — citations that answer indirectly, citations that are topically related but do not answer, and citations that are entirely irrelevant — are non-relevant. A citation may be authentic and accurately quoted yet still non-relevant.
This binary label is derived from a four-tier annotation rubric — direct answer, indirect answer, relevant but no answer, and non-relevant — developed for the ground-truth annotation of Qur'an and Hadith citations.
Participating systems are limited to models of 13B parameters or fewer.
The submissions will be evaluated using Macro-F1 per question, then averaged across all questions. Qur'an and Hadith are scored separately.