Best Practices

Best Practices for Prompt, Response Rubric, and CoT Design

1) Design Order

Response Rubrics (RR) come first: they define what the model must output.
Chain-of-Thought (CoT) rubrics come second: they explain how the model reasons to those outputs.

2) Response Rubrics (RR)

Purpose

Define the final, observable outputs required to satisfy the prompt. These should be the final calculation or answer that the model would have to answer to satisfy the prompt.

Rules

Use concrete, verifiable verbs (states, identifies, calculates).
One rubric = one claim (no stacking)
Binary evaluation (true/false).
Outputs must be fully determined by prompt inputs.
Avoid reasoning language (analyzes, considers, assumes).
Avoid intent labels (correctly, incorrectly).
Include explicit tolerances for calculated numbers (typically ±0.5%–1% in the same units).
These must give the final answers that the model would be expected to give.

Negative Criteria

The score encodes incorrectness; wording should remain factual.

3) Chain-of-Thought (CoT) Rubrics

Purpose

Describe the reasoning steps the model should demonstrate to arrive at the RR outputs.

Rules

Use cognitive verbs (analyzes, evaluates, recognizes).
Do not introduce new facts, assumptions, or calculations.
CoT cannot compensate for missing prompt inputs.
The justification should explain the method and reasoning of why it is required to reach the final outputs. Try to avoid free-form chain-of-thought and use structured justifications.
You should not create any CoT rubric items for negative criteria in the Response Rubric.

4) Timelessness (Core Constraint)

Goal

Ensure the task yields the same expected outputs and grading when run in the future.

Requirements

Clear evidence boundary: graders must rely only on the prompt, the model’s answer, and the rubric text.
Freeze what can change: any rate, threshold, rule, or methodology that could evolve must be explicitly fixed by the prompt.
Deterministic outputs: do not enforce values that depend on defaults, market conditions, or “reasonable assumptions.”
No inference from context: location, industry norms, or common practice cannot substitute for stated inputs.
Controlled external sources: dynamic sources require versioning, cutoffs, or archived snapshots.
Avoid freshness language (current, latest, today).

5) Adding Complexity (Guidance)

High-quality tasks introduce complexity through:

Mixed-use ambiguity
Documentation gaps
Sequencing uncertainty
Financial volatility

6) Avoiding Cascading Penalties

One underlying mistake should have one consequence.
Do not penalize the same interpretive error across multiple dimensions.
Use negative penalties only for explicit, high-risk disqualifiers.
Unmet or out-of-scope criteria should score 0, not negative.

7) Prompt–Rubric Alignment Check

A rubric is valid only if the prompt cannot be fully satisfied without it. The prompt asks should match the rubric answers and vice versa.

Fail conditions:

Requires extra statutory detail not asked.
Enforces numeric outputs with underdetermined inputs.
Depends on unstated assumptions or evolving sources.

All asks in the prompt must be fully satisfied in the criteria, and all criteria items must be fully asked in the prompt for the task to be valid.

Timelessness Details

Time-Stable Prompts and Source Handling

The goal is to create prompts that remain valid, solvable, and consistently reviewable over time. A time-stable prompt yields the same expected outcome when run years later, because it clearly defines what evidence the model is allowed to rely on and prevents unbounded “freshness” assumptions.

This standard prevents two recurring issues:

Answer drift: responses vary because the model relies on different assumptions, conventions, or external context.
Review drift: evaluations become inconsistent because it is unclear what the model was expected to use.

Core principles

1) Evidence boundary must be clear

Prompts must make it clear what information is admissible so a reviewer can evaluate the response later without ambiguity.

2) Use time fences only where they matter

A time cutoff is required when the expected answer could change over time. The cutoff should be written as part of the scenario, tied directly to what the task is asking.

3) Freeze what can change

If a referenced convention, definition, threshold, methodology, or standard could evolve, the prompt must clarify that later updates, revisions, reclassifications, or amendments are out of scope.

Implementation method

To make a prompt time-stable:

State what the model is allowed to use so the result is reviewable later.
Add a cutoff date or version only when needed, embedded naturally in the scenario.
Exclude later updates explicitly when the referenced material is subject to change (revisions, amendments, reclassifications, or consolidated updates).
Avoid open-ended freshness language (“latest,” “current,” “most recent,” “as of now,” “today,” etc.).

Time-fence example

Legal

Analyze the clause using mainstream U.S. commercial contract drafting principles as of June 1, 2024 (typical limitation-of-liability structure and carve-outs), and treat later legal or guidance developments as out of scope.

Your time fence should be written specifically for your scenario, placed inside the prompt where it belongs, and phrased in a way that clearly fits the task context. Avoid using a one-size-fits-all sentence that could apply to any prompt

Dynamic sources for Rubric items

Some sources change frequently, even if the link stays the same. Treat a source as dynamic when it is:

an FAQ, policy page, or best-practices page
vendor documentation without a clear version
a landing page that can swap its linked PDF or underlying content
a dashboard, database view, or interactive page where values change

Rule for dynamic sources

Cite an archived snapshot and include the archive timestamp, rather than relying on the live link. This ensures reviewers and solvers are referencing the same content later.

Suggested services:

https://archive.org/web/ (Wayback Machine)
https://archive.today/

Page updated

Report abuse

Best Practices

Best Practices for Prompt, Response Rubric, and CoT Design

1) Design Order

2) Response Rubrics (RR)

Purpose

Rules

Negative Criteria

Purpose

Rules

4) Timelessness (Core Constraint)

Goal

Requirements

5) Adding Complexity (Guidance)

6) Avoiding Cascading Penalties

7) Prompt–Rubric Alignment Check

Timelessness DetailsTime-Stable Prompts and Source Handling

1) Evidence boundary must be clear

2) Use time fences only where they matter

3) Freeze what can change

Legal

Timelessness Details

Time-Stable Prompts and Source Handling