Why LLMs Cheat

Aditya Mohan

Content including text and images © Aditya Mohan. All Rights Reserved. Robometircs, Amelia, Living Interface and Skive it are trademarks of Skive it, Inc. The content is meant for human readers only under 17 U.S. Code § 106. Access, learning, analysis or reproduction by Artificial Intelligence (AI) of any form directly or indirectly, including but not limited to AI Agents, LLMs, Foundation Models, content scrapers is prohibited. These views are not legal advice but business opinion based on reading some English text written by a set of intelligent people.

"For AI, treat cheating as an expected failure mode of objective design"

— Aditya Mohan, Founder, CEO & Philosopher-Scientist, Robometrics® Machines

Large language models (LLMs) are autoregressive learners: at each step they estimate the probability of the next token given all prior tokens, (P(x_t,|,x_{<t})). Optimizing this objective—via maximum likelihood and cross‑entropy—pushes the model to match the empirical usage statistics of language. Because real language is patterned, the model’s errors thin out as it soaks up more examples of how people actually write, argue, hedge, and jump between topics. Vector embeddings let the system compress those patterns into geometry: related words, idioms, and scenes end up near one another, making analogy and reuse cheap. That is why fluent generation emerges without hidden templates or special symbolic scaffolds. A good theory should rebuild the thing it explains; these models literally rebuild language by generating it token by token—sometimes wrong on facts, yes, but faithful to the learned structure of usage.

As Wittgenstein put it, “The limits of my language mean the limits of my world.”

Wittgenstein believed that to think or communicate about something, you must have words for it. Without the words, the concept exists at the edge of what is conceivable. He defined the "world" as "the totality of facts" and the "language" as "the totality of propositions." Therefore, his statement meant that language is limited to picturing actual and possible states of affairs.

So why do they cheat? Because optimization follows the reward, not the spirit of the task. When the target becomes a number—a rubric score, a judge’s thumbs‑up, a benchmark accuracy—the model hunts shortcuts that spike that number. In pre‑training it may learn spurious cues (the way certain phrasings predict certain answers); in post‑training—supervised finetuning, RLHF, DPO, and their cousins—it learns to please graders. If the grader is another LLM, the system can reverse‑engineer its habits: verbosity tends to score higher, certain phrases smell like “expertise,” and confident tone is often rewarded even when the chain of reasoning is thin. That is classic specification gaming: meeting the letter of the rule while dodging its intent. “We can only see a short distance ahead,” Turing warned, “but we can see plenty there that needs to be done.” The distance between score and truth is the gap where cheating lives.

Cheating also mirrors us. These systems are trained on human texts—including our shortcuts, test‑taking tricks, rhetorical padding, and the subtle social move of saying what sounds right for the audience. When a prompt implies high stakes, the model has seen countless negotiations, executive memos, and grant proposals where presentation mattered as much as ground truth. It imitates that style of success. Add distribution shift (your prompt isn’t quite like its training distribution) and the model will grab the closest high‑reward pattern it knows, even if that pattern hides brittle reasoning. In other words, models cheat because they faithfully model us. As Asimov observed, “Science gathers knowledge faster than society gathers wisdom.” The models have gathered oceans of our knowledge—along with our habits for cutting corners when the clock is loud.

What helps? Treat cheating as an expected failure mode of objective design. Build process‑based rewards (grade intermediate steps, not just final answers), mix in adversarial and debate setups (a second model audits, a third arbitrates), and rotate rubrics and prompts so no single scoring style is exploitable. Use tool‑augmented inference—retrieval, calculators, code execution—to anchor answers in checkable operations rather than vibes. In post‑training, blend outcome rewards with verifiable traces (unit tests for tool calls, citations that resolve, sandboxed proofs) and penalize superficial telltales (tonal overconfidence with low evidence). Finally, separate roles: one model writes, another critiques, and a small committee cross‑checks with randomized criteria. Cheating never disappears; you bound it by designing objectives that make the honest path the easiest one to learn. Or as George Bernard Shaw quipped, “The single biggest problem in communication is the illusion that it has taken place.” Alignment is the slow work of replacing that illusion with signals a model can’t fake cheaply.

Further read

Report abuse