Discovering Modular Transformer Circuits with Sparse Autoencoders

Motivation: Polysemantic Neurons and the Need for Disentangled Features

Mechanistic interpretability research seeks to identify “circuits” – sparse sub-networks inside models that implement specific behaviors or capabilitiesarxiv.org. A longstanding challenge is that individual neurons in transformers are often polysemantic, meaning they activate for multiple unrelated features rather than a single coherent conceptarxiv.org. This entanglement makes it hard to pinpoint concise, human-understandable components of the model’s computationarxiv.orgarxiv.org. One hypothesized cause of polysemantic neurons is superposition – the network represents more independent features than it has neurons by combining multiple features into single neuron directionsmedium.commedium.com. In other words, features are “stacked” in the same dimensions, obscuring interpretability. Addressing this problem requires disentangling the model’s internal features into more meaningful units than raw neuronsarxiv.org. This is the theoretical motivation behind Sparse Autoencoder Transformer circuit discovery: to find a new basis of monosemantic features (each representing a single concept) that can serve as the building blocks of interpretable circuitsmedium.commedium.com.

Sparse Autoencoders for Feature Extraction in Transformers

A Sparse Autoencoder (SAE) is an unsupervised neural network trained to compress and reconstruct data while activating only a small fraction of its latent units at a timemedium.commedium.com. When applied to the activations of a transformer model, an SAE learns an alternate representation of those activations in terms of sparse “code” vectors. The encoder maps a high-dimensional activation (e.g. a residual stream or layer output) into a set of latent features, and the decoder reconstructs the original activation from those featuresmedium.commedium.com. Crucially, a sparsity penalty (such as an L1 regularizer) ensures that each input triggers only a few latent features, encouraging each feature to capture a distinct pattern in the datamedium.commedium.com. This approach is essentially performing dictionary learning on the model’s internal representationspelayoarbues.com. The learned dictionary consists of basis vectors (features) that are linear combinations of the original neurons, but unlike raw neurons, these features tend to be monosemantic and highly interpretablemedium.comarxiv.org. In practice, researchers have found that a single transformer layer with D neurons can be “decomposed” into many more sparse features – for example, a 512-dimensional activation space can yield thousands of meaningful features when using a sufficiently wide sparse autoencoderanthropic.commedium.com. Each such feature corresponds to a recognizable pattern or concept (for instance, a specific style of text, a category of content, or a grammatical construct) that was hidden as a polysemantic combination of neurons beforeanthropic.compelayoarbues.com. By training SAEs on transformer activations, interpretability researchers essentially rotate the model’s basis to uncover these latent components, providing a new view of the network’s computation that is far more modular and intelligible than the original neuron basisarxiv.orgarxiv.org.

Identifying Modular Circuits in Transformer Models

Once we have extracted a library of interpretable sparse features, we can begin to map out how the model’s internal computations are structured in terms of these components. Each feature can be seen as a candidate node in a circuit: it detects a specific pattern in the input or intermediate state, and we can analyze how it influences the model’s later activations and outputs. In transformers, this often means examining how these features flow through layers – e.g. which downstream neurons or attention heads respond to a given feature’s activation, or how multiple features interact to produce a higher-level outcome. Because the features are modular and disentangled, we can isolate circuits more cleanly than before. For example, recent work showed that using sparse features, one can pinpoint the exact latent features responsible for a model’s behavior on a challenging coreference task (the indirect object identification problem) more precisely than earlier circuit analysesarxiv.org. In that case, the SAE-derived features allowed researchers to identify which specific feature encoded the concept of a certain name or entity and which feature encoded grammatical position, illuminating the circuit the model used to resolve pronoun referencesarxiv.org. More generally, sparse autoencoder methods let us trace how information is processed: by identifying which feature-circuits activate for a given input, we can chart a path of causal dependence (e.g. “feature A activates, which then causes feature B to activate in the next layer, leading the model to exhibit behavior X”). This hierarchical tracing of circuits has been demonstrated on tasks like subject–verb agreement and factual recall, where the approach helps isolate the sub-networks (across attention and MLP layers) that implement those capabilitiesopenreview.netopenreview.net. In essence, the sparse features serve as a “circuit map,” breaking the black-box transformer into legible components and connections.

Improving Interpretability and Model Diagnosis

The sparse autoencoder approach offers several practical benefits for interpretability and debugging of large language models:

Comparison with Traditional Probing Methods

Sparse autoencoder-based circuit discovery differs significantly from traditional probing techniques in interpretability. Probing usually involves training an external classifier (often a linear probe) on internal activations to predict a predefined concept or attribute. While useful, this approach is supervised and limited to concepts we explicitly ask about. It can confirm whether a model encodes a known feature (like part-of-speech or sentiment) in some distributed form, but it does not automatically reveal new latent factors in the model’s representation. In contrast, the sparse autoencoder method is unsupervised and data-driven: it uncovers whatever prominent features exist in the activations, including unexpected or high-level concepts that researchers might not have anticipatedpelayoarbues.compelayoarbues.com. This makes it a powerful discovery tool for interpretability – for example, it revealed the existence of features like “backdoor vulnerability in code” or “treacherous turn (deception)” circuits in a model without being specifically instructed to look for thempelayoarbues.com.

Another key difference is in disentanglement and basis alignment. Probing often finds that certain information is present in the neural activations, but it might be spread across many neurons in an entangled way. The probe’s classifier effectively pulls out a direction in activation space that correlates with the concept, but this direction is not guaranteed to be uniquely devoted to that concept – it could partly overlap with others. In contrast, sparse autoencoders directly learn a set of basis directions that aim to be aligned with individual conceptsarxiv.org. By enforcing sparsity, SAEs ensure that each discovered feature is activated in a limited set of contexts, making it far more likely to correspond to a single interpretable causemedium.commedium.com. Importantly, this means sparsity provides an explainability boost: when a feature fires, we can attach a simple description to it (e.g. “this feature indicates an HTML tag is present in the text”) and be relatively confident that description is valid across all occurrences of that feature. Traditional probes typically do not offer such guarantee of consistency or purity in what is being measured.

Finally, sparse autoencoder features can be used within the model for causal experiments (as discussed above), whereas probing is usually a passive analysis. Probing tells us if a concept exists in the network; sparse feature analysis tells us how the concept is embedded in the network’s computations, and even allows us to manipulate it. In summary, sparse autoencoder circuit discovery offers a more comprehensive and interpretable picture of transformer internals than probing: it identifies a whole set of latent dimensions that the model actually uses, disentangles them into human-meaningful units, and facilitates mapping those units into circuits – all without requiring labeled data or prespecified hypotheses. This approach, as demonstrated in recent workanthropic.comarxiv.org, is rapidly becoming a foundational tool in interpretability research, enabling AI models to be examined and guided at the level of their true computational ingredients rather than just their opaque parameters.

References: Recent examples and foundational work in this area include Anthropic’s “Towards Monosemanticity” and “Scaling Monosemanticity” studies, which used sparse autoencoders to discover thousands of interpretable features in transformer modelsanthropic.compelayoarbues.com, as well as work by OpenAI/DeepMind researchers on circuits and superposition that motivated these techniquesarxiv.orgarxiv.org. More recently, collaborations like Cunningham et al. (EleutherAI/MIT) have shown that SAE-discovered features can be leveraged to pinpoint causes of specific model behaviors and improve our ability to reverse-engineer neural circuitsarxiv.orgarxiv.org. Ongoing research (e.g. “transcoder” networksarxiv.org and hierarchical circuit tracingopenreview.net) is extending these ideas to handle non-linear transformations and scale analysis to even larger models. Together, these efforts underline how sparse autoencoder-based circuit discovery is helping transform interpretability from an art into a more systematic science, where we identify and understand the modular components of transformer-based LLMs for better transparency, reliability, and controlarxiv.orgpelayoarbues.com.