Response Rubrics (RR) come first: they define what the model must output.
Chain-of-Thought (CoT) rubrics come second: they explain how the model reasons to those outputs.
Define the final, observable outputs required to satisfy the prompt. These should be the final calculation or answer that the model would have to answer to satisfy the prompt.
Use concrete, verifiable verbs (states, identifies, calculates).
One rubric = one claim (no stacking)
Binary evaluation (true/false).
Outputs must be fully determined by prompt inputs.
Avoid reasoning language (analyzes, considers, assumes).
Avoid intent labels (correctly, incorrectly).
Include explicit tolerances for calculated numbers (typically ±0.5%–1% in the same units).
These must give the final answers that the model would be expected to give.
The score encodes incorrectness; wording should remain factual.
3) Chain-of-Thought (CoT) Rubrics
Describe the reasoning steps the model should demonstrate to arrive at the RR outputs.
Use cognitive verbs (analyzes, evaluates, recognizes).
Do not introduce new facts, assumptions, or calculations.
CoT cannot compensate for missing prompt inputs.
The justification should explain the method and reasoning of why it is required to reach the final outputs. Try to avoid free-form chain-of-thought and use structured justifications.
You should not create any CoT rubric items for negative criteria in the Response Rubric.
Ensure the task yields the same expected outputs and grading when run in the future.
Clear evidence boundary: graders must rely only on the prompt, the model’s answer, and the rubric text.
Freeze what can change: any rate, threshold, rule, or methodology that could evolve must be explicitly fixed by the prompt.
Deterministic outputs: do not enforce values that depend on defaults, market conditions, or “reasonable assumptions.”
No inference from context: location, industry norms, or common practice cannot substitute for stated inputs.
Controlled external sources: dynamic sources require versioning, cutoffs, or archived snapshots.
Avoid freshness language (current, latest, today).
High-quality tasks introduce complexity through:
Mixed-use ambiguity
Documentation gaps
Sequencing uncertainty
Financial volatility
One underlying mistake should have one consequence.
Do not penalize the same interpretive error across multiple dimensions.
Use negative penalties only for explicit, high-risk disqualifiers.
Unmet or out-of-scope criteria should score 0, not negative.
A rubric is valid only if the prompt cannot be fully satisfied without it. The prompt asks should match the rubric answers and vice versa.
Fail conditions:
Requires extra statutory detail not asked.
Enforces numeric outputs with underdetermined inputs.
Depends on unstated assumptions or evolving sources.
All asks in the prompt must be fully satisfied in the criteria, and all criteria items must be fully asked in the prompt for the task to be valid.
The goal is to create prompts that remain valid, solvable, and consistently reviewable over time. A time-stable prompt yields the same expected outcome when run years later, because it clearly defines what evidence the model is allowed to rely on and prevents unbounded “freshness” assumptions.
This standard prevents two recurring issues:
Answer drift: responses vary because the model relies on different assumptions, conventions, or external context.
Review drift: evaluations become inconsistent because it is unclear what the model was expected to use.
Core principles
Prompts must make it clear what information is admissible so a reviewer can evaluate the response later without ambiguity.
A time cutoff is required when the expected answer could change over time. The cutoff should be written as part of the scenario, tied directly to what the task is asking.
If a referenced convention, definition, threshold, methodology, or standard could evolve, the prompt must clarify that later updates, revisions, reclassifications, or amendments are out of scope.
Implementation method
To make a prompt time-stable:
State what the model is allowed to use so the result is reviewable later.
Add a cutoff date or version only when needed, embedded naturally in the scenario.
Exclude later updates explicitly when the referenced material is subject to change (revisions, amendments, reclassifications, or consolidated updates).
Avoid open-ended freshness language (“latest,” “current,” “most recent,” “as of now,” “today,” etc.).
Time-fence example
Analyze the clause using mainstream U.S. commercial contract drafting principles as of June 1, 2024 (typical limitation-of-liability structure and carve-outs), and treat later legal or guidance developments as out of scope.
Your time fence should be written specifically for your scenario, placed inside the prompt where it belongs, and phrased in a way that clearly fits the task context. Avoid using a one-size-fits-all sentence that could apply to any prompt
Dynamic sources for Rubric items
Some sources change frequently, even if the link stays the same. Treat a source as dynamic when it is:
an FAQ, policy page, or best-practices page
vendor documentation without a clear version
a landing page that can swap its linked PDF or underlying content
a dashboard, database view, or interactive page where values change
Rule for dynamic sources
Cite an archived snapshot and include the archive timestamp, rather than relying on the live link. This ensures reviewers and solvers are referencing the same content later.
Suggested services:
https://archive.org/web/ (Wayback Machine)