Data

Downloads

Subtask 1. Explaining Latents

Subtask 2. Scoring Explanations

Data Collection and Annotation

Subtask 1. Explaining Latents

Subtask 2. Scoring Explanations

Downloads

For each split we offer a JSONL version (see examples below) and a HuggingFace Dataset version.
Note that there is just one HuggingFace dataset for each task. Each has a train_gold and train_silver split. The test split will be added during the evaluation window.

Subtask 1. Explaining Latents

TRAIN-GOLD: JSONL File
TRAIN-SILVER: JSONL File
TRAIN-GOLD + TRAIN-SILVER: HuggingFace Dataset

Subtask 2. Scoring Explanations

TRAIN-GOLD: JSONL File
TRAIN-SILVER: JSONL File
TRAIN-GOLD + TRAIN-SILVER: HuggingFace Dataset

Data Collection and Annotation

The dataset for both task was collected using a Sparse Autoencoder (SAE) trained on Minerva-1B-base-v1.0.
Below we detail the main steps included in the data collection and annotation process.

SAE training

The SAE model was trained on the residual stream of Minerva-1B, using the Sparsify libary. It is a k-Sparse Autoencoder with a top-k activation function and an expansion factor of 32.
To train the model, the "tiny" porton of Clean Italian MC4 was used, amounting to roughly 6B tokens.
Please check our paper for more details on the training of the SAE. The paper will be presented at CLiC-it 2025.
The SAE model is available on HuggingFace.

Latent Collection

We collect latent (features) activations from Minerva-1B using the SAE considering only Layer 14 of the model. We chose a layer near the end of the model stack because initial evaluations shown more "semantic" features in later layers.
We collect activations by passing data from the Italian split of Wikipedia through the model, using the Delphi library. For each latent, we collected all tokens that activate it, their surrounding contexts, and the strength of the activation.

Data Annotation

Explanations for latents were obtained semi-automatically. This is also reflected in the setup of the training set for both subtasks.

The core of the explanations were obtained using GPT-5.
Specifically, we provided GPT-5 with examples of activations in context, where activating words where highlighted between "<<" and ">>" ; in addition, a list of (word, activation strength) was also provided to the model. We prompted GPT-5 to "analyze text and provide an explanation that thoroughly encapsulates possible patterns found in it", by looking at activating words; the model was also provided with some examples in the prompt.

Then, part of the explanations provided by GPT-5 were manually revised and corrected by us.

Data Splits

For both subtasks, we provide three different splits:

📚🥇TRAIN-GOLD: Smaller set of training examples, with explanations manually annotated by organisers. It will include a few hundred training examples - Release: September 22.
📚🥈TRAIN-SILVER: Larger set of training examples, with explanations provided by GPT-5. It will include a few thousand training examples - Release: September 22.
📝 TEST: Set of test examples with explanations manually annotated by organisers. It will include a few hundred test examples - Release: Evaluation window (see Important Dates).

Subtask 1. Explaining Latents

The data for Subtask 1 is a single Jsonl file for each split (TRAIN-GOLD, TRAIN-SILVER, and TEST). Each item in the split has the following fields:

Latent ID [str]: the ID of the latent. For example, "layers.14_latent8" for the eight latent of layer 14.
examples [list]: a list of examples of activations for the latent. The number of examples per latent varies, but on average each latent will have around 40 examples. Each example is a dictionary with the following fields:
- text [str]: the text of the example, with activating tokens highlighted between "<<" and ">>" . Note that if two or more contiguous tokens activated the latent, they are kept together, e.g., << like this>>.
- tokens [list]: list of tokens (strings) in the example, as tokenized by the original Minerva-1B-base-v1.0 model
- activations [list]: list of activating tokens found in the example. Each is a dictionary with the following keys:
  - token [str]: the activating token
  - strength [int]: strength of activation for the token, normalized in a range [0, 10]
explanation [str]: the plain text explanation for the latent. For TRAIN-GOLD, the explanation is manually annotated; for TRAIN-SILVER, the explanation is generated by GPT-5; for TEST, explanation is left blank.

Here is an example:

For Subtask 1, participants must provide a single explanation for each latent.

Subtask 2. Scoring Explanations

The data for Subtask 2 is a single Jsonl file for each split (TRAIN-GOLD, TRAIN-SILVER, and TEST). Each item in the split has the following fields:

Latent ID [str]: the ID of the latent. For example, "layers.14_latent8" for the eight latent of layer 14.
explanation [str]: the plain text explanation for the latent. For TRAIN-GOLD, the explanation is manually annotated; for TRAIN-SILVER, the explanation is generated by GPT-5; for TEST, explanation is left blank.
examples [list]: a list of examples, both positive and negative, of sentences (and tokens) that activate/not activate the latent. The number of examples per latent is around 100, equally divided between activating and non-activating. Each example is a dictionary with the following fields:
- text [str]: the text of the example.
- tokens [list]: list of tokens (strings) in the example, as tokenized by Minerva-1B-base-v1.0.
- activations [list]: list of activations, with one value for each token. Zero correspond to no activation. A value higher than zero correspond to an activation. For the test set, activations will be an empty list.
- activating [bool]: True if the exampe contains tokens that activate the latent, False otherwise. For the test set, the label will remain hidden.

Here is an example:

For Subtask 2, participants must provide a prediction for each of the examples, for each latent.

Note that the test set for Subtask 2 will be formatted slightly differently. Specifically, we will provide participants with pairs of < explanation, example >; the system will have to classify whether the latent is activating or non-activating for the example. This means that the test set will include multiple data points for each explanation/latent.
More information on the test set will be available upon release of the data.

Page updated

Report abuse