EXPLAINITA

News

Submissions are NOW OPEN on the Submission page.
Test data and Submission guidelines are now available!
Training data now available!
Thinking to participate? Fill the Expression of Interest Form (does not commit you to participation)
We will be at CLiC-it 2025 to present our paper "Sparse Autoencoders Find Partially Interpretable Features in Italian Small Language Models", upon which EXPLAINITA is based. Come to the presentation at 18:30 on the 24th or find us there if you want to know more!

Task Description

EXPLAINITA focuses on the automated generation and evaluation of natural language explanations for latents (a.k.a. features) of Sparse Autoencoders for Italian Language Models trained on Italian data.

Finding ways to obtain human-interpretable explanations for activation patterns in neural networks is a crucial step towards interpretability of their behaviours.

The goal is twofold:

Given a latent dimension from a Sparse Autoencoder (SAE) trained on the residual stream of a Language Model, represented by a list of tokens that strongly activate this latent, and their surrounding contexts, produce a natural language explanation describing the concept(s) represented by that latent, if any interpretable concept exists.
Given an explanation and some text examples, decide whether the text example activates the SAE latent described by the explanation.

Find more about the tasks in the Tasks & Evaluation page.

Motivation & Relevance

The field of Mechanistic Interpretability (MechInterp) is rapidly growing due to the opportunities it offers for model explainability, especially in understanding how Large Language Models (LLMs) make predictions [1].

A popular approach uses Sparse Autoencoders (SAEs) [2] to uncover the inner workings of neural networks. SAEs reconstruct internal activations by projecting them into a higher-dimensional latent space, but with a sparsity constraint. This pushes the model to learn monosemantic features (also called latents)—features that ideally map to a single, interpretable concept [3].

In SAE-based approaches, latents are explained by looking at the tokens and contexts that triggered them. Explanations can be obtained from human annotators or from an LLM (often called Auto-Interpretability) [4, 5, 6]. Both options have limitations:

Human annotations are accurate but extremely costly, since even small models have tens of thousands of latents to describe.
LLM-generated explanations are cheaper, but their quality is inconsistent—especially for non-English languages [7], where training data and strong open-weight models are limited. Another challenge is building scoring methods that can evaluate explanation quality without relying on human labels [6].

We believe Auto-Interpretability is a promising way to generate large-scale, low-cost annotations—provided that:

the Explainer model produces meaningful explanations, and
the Scorer model reliably judges their quality.

So far, most efforts have focused on English models, with only limited work in Italian [7].

With EXPLAINITA, our goal is to advance MechInterp for Italian LMs, focusing on two main aspects:

Generating coherent and relevant explanations for latents produced by SAEs.
Developing strong scoring models to ensure explanation quality.

References

[1] C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, S. Carter, Zoom in: An introduction to circuits, Distill 5 (2020) e24.
[2] B. A. Olshausen, D. J. Field, Sparse coding with an overcomplete basis set: A strategy employed by v1?, Vision research 37 (1997) 3311–3325.
[3] H. Cunningham, A. Ewart, L. Riggs, R. Huben, L. Sharkey, Sparse autoencoders find highly interpretable features in language models, 2023. URL: https://arxiv.org/abs/2309.08600. arXiv:2309.08600.
[4] A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, T. Henighan, Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet, Transformer Circuits Thread (2024). URL: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
[5] L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, J. Wu, Scaling and evaluating sparse autoencoders, arXiv preprint arXiv:2406.04093 (2024).
[6] G. Paulo, A. Mallen, C. Juang, N. Belrose, Automatically interpreting millions of features in large language models, arXiv preprint arXiv:2410.13928 (2024).
[7] A. Bondielli, L. Passaro, A. Lenci, Sparse autoencoders find partially interpretable features in italian small language models, in: Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), 2025.
[8] T. Lieberum, S. Rajamanoharan, A. Conmy, L. Smith, N. Sonnerat, V. Varma, J. Kramar, A. Dragan, R. Shah, N. Nanda, Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, in: Y. Belinkov, N. Kim, J. Jumelet, H. Mohebbi, A. Mueller, H. Chen (Eds.), Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Miami, Florida, US, 2024, pp. 278–300. URL: https://aclanthology.org/2024.blackboxnlp-1.19/. doi:10.18653/v1/2024.blackboxnlp- 1.19.

Page updated

Report abuse