EXPLAINITA is divided in two independent subtasks. Participant can submit systems for either or both subtasks.
The first subtask is a generative task. It is defined as follows: for each latent feature, we consider the set of tokens that activate it. For every such token, we also take into account its surrounding context, understood as a window of tokens where the given token can appear in any position. The goal of the task is to generate an explanation that captures the underlying concept represented by the latent feature and that is semantically appropriate to describe all token–context pairs associated with it.
Participating systems are free to use any external knowledge or model enhancement techniques (such as distillation, synthetic data generation, and similar approaches) for training. The use of generative large language models is not required.
Participating systems will be scored against a human-annotated test set. Specifically, we will use a similarity-based evaluation, namely BERTScore, with explanation provided by humans on the same latents.
The baseline system is an Explainer model akin to the one used for [1]. Specifically, it generates explainations via prompting of an LLM, with no additional tuning. The LLM used is hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
The second subtask is a classification task. It focuses on developing a strong scoring model that can distinguish between correct and incorrect explanations of latents.
For each latent feature, there is an explanation linked to it. Along with this explanation, a set of example sequences is provided, each having the same probability of including tokens that the explanation should cover. The goal of the task is to determine, for each sequence in the set, whether it is correctly described by the explanation.
The classification has two possible outcomes:
Explained (class 1): the entire sequence activates the latent that the explanation is meant to describe.
Not explained (class 0): the entire sequence does not activate that latent.
The evaluation for Subtask 2 will be conducted scoring each individual prediction independently. This means that systems will get credit even if some predictions for a latent explanation are incorrect.
We will compute Accuracy, Precision, Recall and F1-score for all participating systems. The final ranking will be based on Accuracy.
For this task, participants can leverage any external knowledge and any classification system, including generative ones.
The baseline system is a Scorer model akin to the one used for [1]. Specifically, it generates predictions for an explanation via prompting of an LLM, in a zero-shot setting. The LLM used is hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4.
Commercial-only AI systems (e.g., GPT, Claude, Gemini, Grok) CANNOT be used directly to obtain explanations (subtask 1) and/or labels for items (subtask 2); solutions that require API calls to commercial models will not be accepted as part of the final system.
Commercial-only AI system CAN however be used during development for data augmentation, distillation, etc.
Participants can submit any number of runs/systems to the competition. However, they must choose one run/system for each task, that they will mark upon submission as the "Primary" one. Details on how to submit the Primary and Additional runs/systems are provided in the Submission page.
Participants will be asked to make their systems available to the organisers during the evaluation window. They can use a platform of their choice (e.g., HuggingFace, GitHub) to give access to their model and detailed instruction on how to replicate the results. This ONLY applies to the Primary run.
Note that Rules may be updated following FAQs etc. during the challenge.