I'm a research scientist at the UK AI Security Institute (AISI), working in the White Box Evals team. We're trying to develop mitigations for sandbagging (strategic underperformance). Currently we're looking at lie-detector probes and black box monitors, and training sandbagging model organisms in order to stress-test them.
Before this I was interning at the Center for Human-Compatible Artificial Intelligence under Erik Jenner. We were developing mechanistic anomaly detection techniques to automatically flag jailbreaks and backdoors at runtime, by detecting unusual patterns of activations. We also focused on fine tuning backdoored LLMs which shed their harmfulness training in various trigger circumstances, in order to test these anomaly detection methods.
I’m also finishing up my thesis for a PhD at the University of Queensland, Australia, under Ian McCulloch. I’ve been working on new "tensor network" algorithms, which can be used to simulate entangled quantum materials, quantum computers, or to perform machine learning.
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.
Simple whitebox techniques (eg. probes, or fitting a Gaussian in activation space) can perform well on some tests, flagging jailbreaks or backdoored inputs to LLMs with high recall. However we show that these techniques are brittle to weak attempts at obfuscation.
For example, simply stopping training of the backdoored model earlier can cause the detection methods to fail - the backdoored activations look more like normal activations:
If the attacker has more control over the training process, they can use terms in the loss-function to create a backdoored model which resists almost all latent-space detectors simultaneously.
But we do a lot more than simply create obfuscated backdoors! We investigate jailbreaks and writing SQL code while obfuscating against SAEs, latent adversarial training, probes, and circuit breakers.
Or see twitter, LessWrong,
our codebase, or our neuronpedia page
Sparse autoencoders (SAEs) are an unsupervised method for interpreting machine learning models, allowing you to decompose any space of activations within the model into a set of sparsely activating "features". But vanilla SAEs often aren't able to explain as of the model's performance as we'd like, and they may learn features which reflect structure of the input data, rather than structure of the model we're trying to interpret.
Rather than training sparse autoencoders to reproduce the activations at the current layer, we trained them to reproduce the activations of all downstream layers, as well as the output logits of the original model. This significantly improves the trade-off between performance degradation (eg. CE_loss) and interpretability (eg. L_0, or autointerp scores), while removing functionally irrelevant classes of features.
The cost is that there are some (apparently functionally irrelevant) intermediate subspaces of activations which these SAEs don't reconstruct, and it takes much longer to train them with this loss. We believe that best practice going forward will probably be to pre-train SAEs with regular local loss, then fine-tune them with the downstream end-to-end loss we've introduced.
We have open sourced our library for training SAEs with this new method: https://github.com/ApolloResearch/e2e_sae
A selection of our SAEs are hosted on neuronpedia so you can play with them in your browser: https://www.neuronpedia.org/gpt2sm-apollojt
Ian McCulloch and I recently proposed a criterion for finding effectively decohered wavefunction "branches" in arbitrary quantum states, without the need for a system / environment split. We say that branches are defined if you can write your quantum state as a superposition of terms which are easy to distinguish, but hard to interfere (as measured by the number of local operations required).
We argue that these branches
Allow the full state to be replaced with a probability distribution over the branches
Only grow "further apart" for exponentially long times under natural time evolution
Tend to absorb spatial entanglement
Are strengthened by the presence of conserved quantities
Are effectively the opposite of good error correcting codes
We conjecture that branch formation is a ubiquitous process in nature, occurring generically in the time evolution of many-body quantum systems (even when there is no clear "environment"). We're currently looking for branches in various numerical time evolution simulations, by developing an algorithm to find them tensor-network states.
Check out the paper or the poster! We give many examples of states with good branch decompositions.
Good branches are effectively the opposite of good error-correcting codes.
The complexity of distinguishing two states |a⟩ and |b⟩ is ~ the size of the smallest circuit satisfying this (~ swapping |a⟩+|b⟩ with |a⟩-|b⟩ ).
The complexity of interfering two states |a⟩ and |b⟩ is ~ the size of the smallest circuit satisfying this (~ swapping |a⟩ with |b⟩ ).
A tutorial applying Penrose graphical notation to mechanistic interpretability: understanding how Transformer AI systems like GPT work, and the most basic kinds of algorithms they can learn internally. As well as AI systems, I also apply the notation to the singular value decomposition and its higher-order extensions, and introduce tensor-network decompositions.
The result of a two month internship at Max Kelsen, applying machine learning to neuroscience research. I used data from a few neurons in the brains of rats to predict the timing of audio beeps they were hearing. I applied unsupervised clustering techniques, as well as supervised neural networks and gradient-boosting. I found interpretability tools to be vital for increasing generalization robustness to new neurons, new sessions, new audio tones, and new rats.
A pretty simple new method to run the Crank-Nicolson method in parallel. This is a method for implicitly solving partial differential equations (PDEs) like the time-dependent Schrodinger equation. Our parallel modification provides a speedup even though it increass compute usage, and extends the Crank-Nicolson method to allow it to hande PDEs with non-linear terms. C++ code is available at https://github.com/jordansauce/iterative_parallel_CN
I ran numerical calculations to characterize errors in the world's most accurate atomic clocks. Specifically, AC-stark frequency shifts induced by the trapping laser in strontium optical lattice clocks. The effects of these frequency shifts can be partially cancelled by operating the trapping laser at a "magic wavelength", so the characterizing the dominant further errors required going to 4th order perturbation theory, calculating the "hyperpolarisability" of strontium using C++ code.