“Fair tests require short memories; fair systems require long ones.”
— Aditya Mohan, Founder, CEO & Philosopher-Scientist, Robometrics® Machines
A perfect evaluator begins where ordinary judgment fails: at the seam between memory and measurement. Humans cannot help but accumulate context; yesterday’s brilliant violinist makes today’s merely excellent pianist seem thin by contrast, the first interview of the morning colors the fifth, and a stray anecdote lingers like perfume over every subsequent score. This is strength in life but noise in testing. By design, an AI evaluator can do the opposite—enter each sample as if for the first time, purge the slate between trials, and treat every candidate, code path, or experiment as an independent draw. The mood in such a system is clinical and calm: lights steady, rubric fixed, temperature controlled, order randomized, and no residue from what came before.
The scientific core is independence. In human panels, latent state drifts across a session—anchoring, halo and horn effects, fatigue, contrast, and narrative momentum all couple one decision to the next. Inter‑test contamination is statistically visible: variance that should be idiosyncratic becomes correlated. A large language model can be instrumented to avoid this: zero context between items, fixed or stratified random seeds, counter‑balanced prompts, and pre‑registered rubrics with item response theory (IRT) calibration. Each evaluation becomes a fresh experiment with known priors and controlled randomness. Francis Bacon warned, “The human understanding, when it has once adopted an opinion, draws all things else to support and agree with it.” The unremembering judge resists this pull by construction, not by will.
Which is better—LLM eval or human eval? For high‑volume, well‑specified tasks (grading structured outputs, rubric‑bounded interviews, safety red‑teaming with known threat models), instrumented LLM judging is typically superior on reliability, speed, and cost, because it can produce many independent samples and ensemble them without cross‑talk. But AI has its own failure modes: prompt sensitivity, model updates that shift baselines, latent training contamination, shallow agreement with the rubric rather than the construct, and brittle behavior under out‑of‑distribution novelty. Humans remain better for tacit signals—reading subtext, inventing new criteria when the rubric is wrong, or integrating ethics, context, and stakes. The pragmatic answer is hybrid: let AI generate independent, blinded, and reproducible measurements; let humans adjudicate edge cases, redefine the construct, and own accountability.
Blueprint for a near‑perfect evaluator: enforce statelessness between items; randomize order and wording; diversify an ensemble of AI judges with orthogonal system prompts and fixed seeds; calibrate the item bank with IRT and generalizability theory; add adversarial probes and counterfactual variants to detect overfit to surface cues; monitor drift with control items salted through every batch; and log everything—hashes of prompts, seeds, model version, and rubrics—for auditability. Most crucially, separate measurement from memory: the evaluator forgets between tests, the organization remembers across them through dashboards, confidence intervals, and decision records. In this division of labor, forgetting is not a flaw but a feature—and the noise that tempts human judgment becomes manageable signal for machines to weigh, and for humans to ultimately decide.