Here is a checklist you can use to make sure your task is ready to be signed off!
Link to the checklist -- Feel free to make a copy and check things off as you double check your task!
Task Sign-Off Checklist
1. PROMPT
States what the model should accomplish, not how to do it step by step
All inputs numbered in a legend; every legend input is tested by at least one rubric item
No contradictions — don't restrict a credit path while using its terminology in preferences or asks
Only include bonus categories that apply to the specified credit path
Legal/regulatory framework named and pinned to a date; no "current," "latest," or "as of today"
In-scope and excluded items both stated explicitly
When payment/PIS dates span a year boundary, specify which date controls for each credit
Multi-tier provisions specify the exact subcategory (e.g., §48(e)(1)(A)(i), not just §48(e))
Every number that flows into a rubric criterion is stated directly — not back-calculated
If multiple parties can claim credits, state who claims what and confirm basis allocations sum to project cost
2. PROMPT–RUBRIC ALIGNMENT
Every prompt ask has at least one RR criterion; every RR criterion traces to a prompt ask
No orphaned instructions; no surprise criteria
Forbidden credit paths appear only as negative criteria, never positive
Trace one early mistake through the rubric — if it triggers 3+ failures, consolidate
3. RESPONSE RUBRIC
One claim per item, pass/fail, with a point weight
All numeric criteria include ground truth values and explicit tolerances
Observable verbs only (States, Computes, Lists) — no subjective language
Rubric language matches the actual legal mechanism (unavailable ≠ zero rate ≠ zero after proration)
Sources are primary (statute, regulation, agency guidance); citations match the credit path in the prompt
Inflation-adjusted rates and state credits verified for the PIS year
Justifications show math or logic, not restatements
Negative items catch real model mistakes; don't stack on the same error; scores cap at zero
4. CHAIN OF THOUGHT RUBRIC
Tests reasoning, not outputs — if it reads like an RR item, move it
No duplicates of RR items; no CoT items for negative RR criteria
Don't test reasoning for credit paths that are ineligible for the component
Cognitive verbs only: "Explains why," "Evaluates whether" — not "States" or "Calculates"
5. DIFFICULTY & SCORING
Less than ~60% average pass rate
Difficulty comes from the scenario, not artificial constraints
Total positive and negative points calculated; validated against target models
6. FINAL VALIDATION
Tested against latest target models; grading produces expected pass rates