Is my LLM getting dumber, or is it just me?

The AI you fell in love with may not the AI you are married to.


I ran 43 tests against every GPT-4 variant I could get my hands on. The results made me uncomfortable.


Something strange has been happening with GPT.

Long-time users have been complaining on forums and social media that the AI feels “dumber” than it used to be. Developers report that code suggestions that once worked flawlessly now come back broken. Researchers notice that math problems the model used to ace are suddenly producing wrong answers.

At first, I dismissed these complaints as the usual tech grumbling. Maybe people were just noticing limitations they had previously overlooked. Maybe the honeymoon period with AI had simply ended.

Then I ran the numbers myself. What I found should concern anyone who relies on AI tools for serious work.


The pachinko problem

To understand what might be happening, consider a business model from Japanese gambling halls.

Pachinko parlors are notorious for a specific pattern. When you first sit down at a machine, you often win. The balls cascade, the lights flash, and you think you have figured something out. But keep playing, and the wins gradually dry up. The machine has been calibrated (well... hammering the nails in the old days) to hook you with early rewards, then slowly tighten the odds.

Players keep chasing that initial high. The house profits from their optimism.

This pattern, sometimes called “bait and switch,” appears across many industries. Mobile games give you easy victories in early levels before ramping up difficulty and pushing in-app purchases. SaaS companies offer generous free tiers that quietly shrink once you have built your workflow around them. Streaming services launch with low prices that creep upward after you have accumulated years of watch history.

The question is whether something similar is happening with large language models.

 

What “nerfing” actually means

In gaming communities, “nerfing” refers to weakening a character or weapon through software updates. Players who relied on a particular strategy suddenly find it no longer works.

In the AI context, users have borrowed this term to describe a perceived decline in model quality over time. But what could actually cause an AI to get worse?

Several mechanisms are plausible:

From the outside, all these mechanisms produce the same observable effect: the AI behaves differently than before. Sometimes worse, sometimes just different, but definitely not stable.

 

Testing the hypothesis

As a researcher, I am trained to be skeptical of claims that lack data. So I built a small but careful benchmark to test whether model quality has actually declined.

The test suite includes 43 prompts across seven categories:

I deliberately avoided viral examples like “How many R’s in strawberry?” that models might have been specifically trained on. Instead, I used uncommon words like “peripherally,” “nevertheless,” and long medical terminology.

All tests ran with temperature set to zero and a fixed random seed. This minimizes randomness and makes any differences more likely to reflect actual model changes.

I tested eight OpenAI models:

The results were striking.