The spread across all 24 hours was only 3.%. That’s statistical noise. There’s no magic hour. The pattern is flat: every hour hurts by roughly the same amount.
. . .
This makes sense once you understand how LLMs actually work.
They have no internal clock. When you send a prompt to GPT or Claude or Gemini, the model receives tokens and produces tokens. That’s it. It has no awareness of when the request arrived or what timezone you’re in.
We may think this is a bug but it's actually not. It’s a deliberate design choice.
If you want the model to know the current date, you tell it. OpenAI, Anthropic, and Google all handle time this way. The model doesn’t “feel” different at 3 AM versus 3 PM. The timestamp you inject is just more text.
So why does adding it hurt performance?
“It is currently 2:00 PM” looks like the opening of a human-to-human conversation. So the model completes it like one. You’re not degrading the model. You’re invoking a different version of it: same weights, different behavior region.
. . .
The culprit is verbosity.
The model pattern-matches your context to its training data. Conversational preambles trigger conversational completions. It gets chatty. It explains itself. It formats things nicely, exactly what you’d want in a human-facing chatbot, but not what you want when your pipeline expects a single token like “4” or “positive.”
Here’s what I mean.
The prompt: “How many times does the letter ‘e’ appear in ‘nevertheless’?”
Correct answer: 4.
Without time context, the model responds:
4
With time context:
The letter e appears 4 times in ‘nevertheless’.
Both contain the right answer. But my evaluation expected a clean number. The verbose response failed.
In other words, most of the performance loss is a formatting issue, not a reasoning issue. The model knows the answer. But the extra words break simple exact-match evaluation pipelines and downstream parsers.
This is exactly the problem researchers face when building LLM-based measures. If you ask GPT to classify an earnings call as “positive,” “negative,” or “neutral,” you need it to return one word. If it returns “I would classify this as positive because...,” your parsing breaks.
The model knows the answer. It just expresses it in a way that defeats automated extraction.
. . .
Here’s where it gets interesting.
Governance frameworks require audit trails. You need to log what the model received and what it produced. Timestamps become mandatory.
But if injecting timestamps degrades accuracy by 10%, then the compliance mechanism undermines the output being audited. You’re paying a tax on accuracy to satisfy a governance requirement.
The audit trail exists to ensure quality. Yet creating the trail reduces the quality of the thing being recorded.
Academic research faces the same tension. Journals increasingly expect you to document your LLM usage. Good for reproducibility. But if researchers start embedding metadata in prompts to satisfy documentation requirements, they may introduce bias into the measure being documented.
. . .
This matters for anyone deploying Copilot or similar enterprise AI tools.
Many organizations roll out these tools without visibility into what happens behind the scenes. The system prompt is proprietary. You don’t know what contextual information gets injected automatically: timestamps, user IDs, session metadata, organizational context.
If your compliance team validated the tool in a clean test environment and then deployed it with additional context injection, there’s a gap between your tested accuracy and your production accuracy. Nobody may have measured it.
From a controls perspective, that’s equivalent to silently changing the questionnaire after you’ve validated the survey instrument.
For tax compliance, this is concrete.
Imagine using AI to classify transactions for GST purposes. Standard-rated versus exempt versus zero-rated. A model that correctly classifies 85% of transactions in testing might only hit 75% in production if the production system injects timestamps.
That 10% gap represents real errors. Under-reporting. Over-reporting. Misclassified supplies. The kind of discrepancies that surface during audits and require explanation.
The same logic applies to transfer pricing documentation, corporate tax computations, and any workflow where AI extracts structured data from messy documents. If the production prompt differs from the validation prompt, your validation results don’t mean what you think they mean.
For researchers, the implication is methodological. Your validation sample needs to match your production environment exactly. Reviewers won’t be able to replicate your results if their prompt structure differs from yours.
This connects to findings I documented in an earlier analysis of GPT model drift. The model behind the API can change without notice. Prompt sensitivity compounds that instability.
. . .
So what should you do?
Log externally. Keep timestamps in your logging infrastructure, not in prompts. The audit record exists. The model never sees it.
Reinforce format instructions. If you must include context, add explicit output requirements: “Respond with only a number,” “Answer with one of: positive, negative, neutral,” or “Return valid JSON only.” This counteracts the conversational drift.
Test your actual production environment. Don’t assume validation results hold. Run your test suite against the real system prompt and context injection used in production.
Accept the tradeoff consciously. Sometimes the audit requirement is non-negotiable. Document the known performance impact. A 10% accuracy penalty may be acceptable if the alternative is no audit trail.
. . .
The broader lesson is simple.
“More context is better” breaks down in subtle ways. Not all context helps. Some actively hurts.
To be clear: if the task genuinely depends on time (say, applying a tax rule that changed on a specific date) you absolutely should tell the model. The problem here is irrelevant context, not context in general.
Time-of-day information has no bearing on counting letters or classifying transactions. Including it doesn’t help the model reason. It just makes the model think you want a different kind of response.
We often add controls and documentation requirements without measuring their second-order effects. The control becomes ritual rather than safeguard. We check the box without asking whether the box makes things better or worse.
The next time you design an AI prompt, ask yourself: does this piece of context help the model perform the task, or is it just there for us?
Sometimes the best thing you can give an AI is less.
. . .
Methodology note: Each prompt has a single algorithmically verifiable answer. For arithmetic, the output must match the expected integer exactly. Temperature was 0.0. The 10,000 calls: 20 prompts × 25 time conditions × 20 runs per combination.
Raw data available on request. For related findings on model drift, see: “Is my LLM getting dumber, or is it just me?”