Discovering Modular Transformer Circuits with Sparse Autoencoders

Motivation: Polysemantic Neurons and the Need for Disentangled Features

Mechanistic interpretability research seeks to identify “circuits” – sparse sub-networks inside models that implement specific behaviors or capabilitiesarxiv.org. A longstanding challenge is that individual neurons in transformers are often polysemantic, meaning they activate for multiple unrelated features rather than a single coherent conceptarxiv.org. This entanglement makes it hard to pinpoint concise, human-understandable components of the model’s computationarxiv.org arxiv.org. One hypothesized cause of polysemantic neurons is superposition – the network represents more independent features than it has neurons by combining multiple features into single neuron directionsmedium.com medium.com. In other words, features are “stacked” in the same dimensions, obscuring interpretability. Addressing this problem requires disentangling the model’s internal features into more meaningful units than raw neuronsarxiv.org. This is the theoretical motivation behind Sparse Autoencoder Transformer circuit discovery: to find a new basis of monosemantic features (each representing a single concept) that can serve as the building blocks of interpretable circuitsmedium.com medium.com.

Sparse Autoencoders for Feature Extraction in Transformers

A Sparse Autoencoder (SAE) is an unsupervised neural network trained to compress and reconstruct data while activating only a small fraction of its latent units at a timemedium.com medium.com. When applied to the activations of a transformer model, an SAE learns an alternate representation of those activations in terms of sparse “code” vectors. The encoder maps a high-dimensional activation (e.g. a residual stream or layer output) into a set of latent features, and the decoder reconstructs the original activation from those featuresmedium.com medium.com. Crucially, a sparsity penalty (such as an L1 regularizer) ensures that each input triggers only a few latent features, encouraging each feature to capture a distinct pattern in the datamedium.com medium.com. This approach is essentially performing dictionary learning on the model’s internal representationspelayoarbues.com. The learned dictionary consists of basis vectors (features) that are linear combinations of the original neurons, but unlike raw neurons, these features tend to be monosemantic and highly interpretablemedium.com arxiv.org. In practice, researchers have found that a single transformer layer with D neurons can be “decomposed” into many more sparse features – for example, a 512-dimensional activation space can yield thousands of meaningful features when using a sufficiently wide sparse autoencoderanthropic.com medium.com. Each such feature corresponds to a recognizable pattern or concept (for instance, a specific style of text, a category of content, or a grammatical construct) that was hidden as a polysemantic combination of neurons beforeanthropic.com pelayoarbues.com. By training SAEs on transformer activations, interpretability researchers essentially rotate the model’s basis to uncover these latent components, providing a new view of the network’s computation that is far more modular and intelligible than the original neuron basisarxiv.org arxiv.org.

Identifying Modular Circuits in Transformer Models

Once we have extracted a library of interpretable sparse features, we can begin to map out how the model’s internal computations are structured in terms of these components. Each feature can be seen as a candidate node in a circuit: it detects a specific pattern in the input or intermediate state, and we can analyze how it influences the model’s later activations and outputs. In transformers, this often means examining how these features flow through layers – e.g. which downstream neurons or attention heads respond to a given feature’s activation, or how multiple features interact to produce a higher-level outcome. Because the features are modular and disentangled, we can isolate circuits more cleanly than before. For example, recent work showed that using sparse features, one can pinpoint the exact latent features responsible for a model’s behavior on a challenging coreference task (the indirect object identification problem) more precisely than earlier circuit analysesarxiv.org. In that case, the SAE-derived features allowed researchers to identify which specific feature encoded the concept of a certain name or entity and which feature encoded grammatical position, illuminating the circuit the model used to resolve pronoun referencesarxiv.org. More generally, sparse autoencoder methods let us trace how information is processed: by identifying which feature-circuits activate for a given input, we can chart a path of causal dependence (e.g. “feature A activates, which then causes feature B to activate in the next layer, leading the model to exhibit behavior X”). This hierarchical tracing of circuits has been demonstrated on tasks like subject–verb agreement and factual recall, where the approach helps isolate the sub-networks (across attention and MLP layers) that implement those capabilitiesopenreview.net openreview.net. In essence, the sparse features serve as a “circuit map,” breaking the black-box transformer into legible components and connections.

Improving Interpretability and Model Diagnosis

The sparse autoencoder approach offers several practical benefits for interpretability and debugging of large language models:

Disentangled, Monosemantic Features: By resolving polysemantic neurons into sparse features, we obtain units that each correspond to a single concept or signal in the modelmedium.com. These features are far less ambiguous than raw neurons – for instance, instead of one neuron firing for both legal terminology and financial data, an SAE might yield separate features for each domain. This makes it much easier to reason about what the model “knows” or is doing internallyarxiv.org medium.com. In Anthropic’s experiments, many such features turned out to be remarkably interpretable, representing concepts like DNA sequence patterns, cities and countries, or source code type signatures, which were not apparent from individual neuronsanthropic.com pelayoarbues.com.
Modularity and Circuit Clarity: Each sparse feature can be treated as a module that plugs into the model’s computation. This modularity helps in understanding and visualizing circuits – one can see which features co-activate or feed into each other for a given task. As a result, researchers can form hypotheses about model behavior by examining which feature-circuits are involved. For example, if a language model suddenly produces a biased or toxic output, we might find that a certain “bias feature” was active in the prior layer’s representationpelayoarbues.com pelayoarbues.com. This gives a concrete target for analysis (Why did this feature activate? Which earlier text triggered it? How does it influence the next layer?) that would be hard to get from entangled neuron activations alone.
Interventions and Steering: Perhaps most excitingly, these disentangled features open the door to controlling and diagnosing model behavior. Since the features exist in the model’s activation space, we can attempt interventions like ablation or activation of specific features to see how the model’s output changespelayoarbues.com. Researchers have shown that features can indeed be used to “steer” the model’s generation – for example, turning on a feature for a specific style or content can make the model produce more of that, while turning off a safety-critical feature might prevent a certain harmful behaviorpelayoarbues.com. In Anthropic’s study on a large model (Claude 3), the team identified features related to things like security vulnerabilities in code, instances of deception, bias, and even the model’s self-identitypelayoarbues.com pelayoarbues.com. Although preliminary, this suggests we could detect and monitor safety-relevant internal activations. In the long run, such insight might allow us to reliably flag when a model is, say, engaging in deceptive reasoning or about to produce disallowed content, by observing its internal feature activationspelayoarbues.com pelayoarbues.com. At the very least, knowing these features exist and understanding when they activate is a first step toward safer and more transparent AI systemspelayoarbues.com pelayoarbues.com.
Faithfulness and Completeness: Unlike many interpretability tools, the sparse autoencoder is trained directly on the model’s own activations and attempts to faithfully reconstruct themmedium.com. This means the discovered features collectively express essentially all information in the original activation (barring minor reconstruction error). In contrast, traditional probing methods only capture a slice of information (for example, whether a specific property can be linearly predicted). The SAE’s features form a basis for the activation space, which encourages a more complete and non-arbitrary breakdown of the network’s computation. In practice, this has enabled finer-grained analyses – for instance, isolating the exact feature responsible for an observed behavior, rather than merely noting that “some combination of neurons correlates with the behavior”arxiv.org arxiv.org. This fidelity to the model’s internal representations gives us greater confidence that we are studying the model on its own terms (i.e. the features it actually uses), which is crucial for reliable interpretability.

Comparison with Traditional Probing Methods

Sparse autoencoder-based circuit discovery differs significantly from traditional probing techniques in interpretability. Probing usually involves training an external classifier (often a linear probe) on internal activations to predict a predefined concept or attribute. While useful, this approach is supervised and limited to concepts we explicitly ask about. It can confirm whether a model encodes a known feature (like part-of-speech or sentiment) in some distributed form, but it does not automatically reveal new latent factors in the model’s representation. In contrast, the sparse autoencoder method is unsupervised and data-driven: it uncovers whatever prominent features exist in the activations, including unexpected or high-level concepts that researchers might not have anticipatedpelayoarbues.com pelayoarbues.com. This makes it a powerful discovery tool for interpretability – for example, it revealed the existence of features like “backdoor vulnerability in code” or “treacherous turn (deception)” circuits in a model without being specifically instructed to look for thempelayoarbues.com.

Another key difference is in disentanglement and basis alignment. Probing often finds that certain information is present in the neural activations, but it might be spread across many neurons in an entangled way. The probe’s classifier effectively pulls out a direction in activation space that correlates with the concept, but this direction is not guaranteed to be uniquely devoted to that concept – it could partly overlap with others. In contrast, sparse autoencoders directly learn a set of basis directions that aim to be aligned with individual conceptsarxiv.org. By enforcing sparsity, SAEs ensure that each discovered feature is activated in a limited set of contexts, making it far more likely to correspond to a single interpretable causemedium.com medium.com. Importantly, this means sparsity provides an explainability boost: when a feature fires, we can attach a simple description to it (e.g. “this feature indicates an HTML tag is present in the text”) and be relatively confident that description is valid across all occurrences of that feature. Traditional probes typically do not offer such guarantee of consistency or purity in what is being measured.

Finally, sparse autoencoder features can be used within the model for causal experiments (as discussed above), whereas probing is usually a passive analysis. Probing tells us if a concept exists in the network; sparse feature analysis tells us how the concept is embedded in the network’s computations, and even allows us to manipulate it. In summary, sparse autoencoder circuit discovery offers a more comprehensive and interpretable picture of transformer internals than probing: it identifies a whole set of latent dimensions that the model actually uses, disentangles them into human-meaningful units, and facilitates mapping those units into circuits – all without requiring labeled data or prespecified hypotheses. This approach, as demonstrated in recent workanthropic.com arxiv.org, is rapidly becoming a foundational tool in interpretability research, enabling AI models to be examined and guided at the level of their true computational ingredients rather than just their opaque parameters.

References: Recent examples and foundational work in this area include Anthropic’s “Towards Monosemanticity” and “Scaling Monosemanticity” studies, which used sparse autoencoders to discover thousands of interpretable features in transformer modelsanthropic.com pelayoarbues.com, as well as work by OpenAI/DeepMind researchers on circuits and superposition that motivated these techniquesarxiv.org arxiv.org. More recently, collaborations like Cunningham et al. (EleutherAI/MIT) have shown that SAE-discovered features can be leveraged to pinpoint causes of specific model behaviors and improve our ability to reverse-engineer neural circuitsarxiv.org arxiv.org. Ongoing research (e.g. “transcoder” networksarxiv.org and hierarchical circuit tracingopenreview.net) is extending these ideas to handle non-linear transformations and scale analysis to even larger models. Together, these efforts underline how sparse autoencoder-based circuit discovery is helping transform interpretability from an art into a more systematic science, where we identify and understand the modular components of transformer-based LLMs for better transparency, reliability, and controlarxiv.org pelayoarbues.com.