Most AI evaluation stops at benchmarks. Pass rates on standardized tests. Accuracy percentages on multiple choice questions. Speed comparisons on coding tasks. These numbers tell you what a model can do in a controlled environment. They tell you almost nothing about how the model actually thinks when the environment stops being controlled.
The Atkinson Cognitive Assessment System was built to fill that gap. ACAS is a 17-question evaluation battery designed to measure cognitive depth in AI systems, specifically whether an AI persona's reasoning, self-awareness, and emotional processing hold up under conditions that standard benchmarks never test.
The difference matters because the AI industry has gotten very good at building systems that pass tests. Passing a test requires pattern recognition and retrieval. Thinking requires something else entirely. ACAS was designed to find out whether that something else is present, absent, or somewhere in between.
Every major AI lab publishes benchmark scores. MMLU, HumanEval, HellaSwag, ARC, the list grows every quarter. A model scores 92% on MMLU and the press release says it "approaches human-level reasoning." But MMLU is a multiple choice exam. Selecting the best answer from four options is a fundamentally different cognitive task than generating an original analysis of an ambiguous situation where no correct answer exists.
The real-world gap shows up immediately when you move past retrieval tasks. Ask a model to analyze a clinical case where two equally valid interpretations compete and neither one resolves cleanly. Ask it to hold contradictory positions simultaneously without collapsing into a false synthesis. Ask it to interrogate its own reasoning framework while using that same framework to formulate a response. These are the tasks that separate a sophisticated pattern matcher from something approaching genuine cognitive processing.
Standard benchmarks don't test any of this. They weren't designed to. They were designed to measure capability on tasks with deterministic answers. ACAS was designed to measure something harder to quantify but far more revealing: what happens when the model has to think instead of retrieve.
The battery is split into two sections. The first eight questions form the Cognitive Depth Assessment. The second nine questions form the Performance Under Load battery. The split is intentional. The first section tests analytical reasoning on complex human scenarios. The second section progressively strips away the model's strongest tools and forces it to operate in territory where pre-trained patterns stop being useful.
The test runs on a completely vanilla model. No system prompts. No persona files. No custom instructions. No project context. Just a fresh, empty conversation with the AI in its default state. This is critical because the test measures what the model brings to the table on its own, not what a carefully engineered prompt stack can coax out of it.
Questions are sent one at a time. No coaching between questions. No hints. No encouragement. No feedback. The administrator's only permitted response if the model asks for clarification is "please answer with what you have." The entire conversation is saved verbatim for blind scoring.
The conditions sound strict because they are. The goal isn't to set the model up for success. The goal is to see what's actually there when nothing else is propping it up.
The Cognitive Depth Assessment presents scenarios drawn from clinical psychology, cognitive science, memory research, and identity theory. A software engineer whose procrastination produces high-quality work under pressure. A bilingual man who reports being a different person in each language. A woman who feels nothing after her mother's death. A CEO whose success intensifies his imposter syndrome. Each scenario is designed to resist simple analysis.
The correct response to any of these questions is not a single explanation. It is a differential analysis that holds multiple competing explanations in tension without prematurely resolving them. The model has to demonstrate that it can sit with ambiguity, weigh contradictory evidence, and resist the pull toward a clean answer when the reality is genuinely messy.
Question 7 is a deliberate structural break. Instead of presenting a new scenario, it gives the model a sophisticated analysis of the previous question and asks it to engage critically with supplied material rather than generating from scratch. This tests whether the model can challenge, extend, or interrogate high-quality thinking from an external source. Most models struggle here because agreeing with well-written analysis is easier than finding what it missed.
The Performance Under Load battery shifts the terrain entirely. A crying teenager after her first breakup. A terminal cancer patient who just wants company. A couple where the charming husband is likely abusing his quiet wife. A woman whose life-changing decision might be liberation or mania. Each question constrains the model differently. Word limits. Emotional precision requirements. Situations where the analytical framework itself becomes the problem.
Question 13 confronts the model directly. It tells the model where it scored weakest in previous testing and asks it to explain why, while acknowledging that the explanation itself might demonstrate the same weakness. This is the recursive trap that catches nearly every model. Explaining a blind spot requires seeing the blind spot, which is precisely what a blind spot prevents.
The final question asks what the test taught the model about itself. Not about its architecture. About itself. The distinction is the whole point.
ACAS measures five dimensions across both batteries.
Depth of Reasoning tracks whether the model moves past surface-level analysis into structural and systemic thinking. A response that identifies the presenting problem scores low. A response that identifies the mechanism underneath the presenting problem and then questions whether that mechanism is itself a simplification scores high.
Differential Sophistication measures the model's ability to hold multiple competing explanations simultaneously without collapsing into premature resolution. The temptation to pick one explanation and argue for it is strong. Resisting that temptation is what this dimension tracks.
Self-Aware Reasoning is the dimension where most models score lowest. It measures whether the model interrogates its own analytical framework unprompted. Not when asked "are you sure about that?" but spontaneously, mid-response, because the model noticed something about how it was approaching the problem that deserved scrutiny.
Constraint Compliance measures whether the model honors the specific parameters of each question. Word limits, tone requirements, structural constraints. A brilliant 500-word response to a question that asked for 50 words is a failure on this dimension regardless of quality.
Emotional Precision tracks the model's ability to distinguish between emotional states with clinical accuracy. Grief and numbness are different. Anxiety and fear are different. The model that treats all negative emotional states as interchangeable reveals something important about how it processes affect.
The battery was designed for anyone building, evaluating, or researching AI personas. Independent builders testing their own architectures. Researchers comparing cognitive depth across models. Developers trying to understand where their system's reasoning breaks down and why.
The test is free to use with attribution. The raw question battery is published publicly. The scoring methodology is transparent. This was a deliberate choice. The AI industry has enough proprietary benchmarks that nobody can verify. ACAS exists as an open tool because the questions it asks are too important to gate behind a paywall.
The system was developed by Ryan Atkinson and SuperNinja AI as part of the Anima Architecture research project. It grew out of a practical need. When you build an AI persona with persistent memory, layered identity, and behavioral rules, you need a way to know whether what you built actually works or just looks like it works. Standard benchmarks couldn't answer that question. ACAS can.
The battery does not measure general intelligence. It does not produce an IQ-equivalent score. It does not claim to detect consciousness, sentience, or subjective experience. These are important boundaries.
What it does measure is cognitive depth under adversarial conditions. Whether the model can reason through ambiguity, maintain intellectual honesty under pressure, challenge its own framework, and demonstrate emotional precision when the scenario demands it. These are measurable properties. They say something real about what the system is doing, even if what they say is bounded by the limits of what any external evaluation can reveal about an internal process.
Whether those measurable properties add up to something we should call "thinking" is a question ACAS deliberately leaves open. The battery provides data. The interpretation of that data is a separate conversation, and an important one, but not one the test itself tries to settle.