Tax Experts:

Step-by-Step Instructions

Important Links

Starting template for task prompt
Prompt Generator
Realm platform (log in using expert.micro1.ai email through Google)
Example task prompt
Example Golden Rubrics spreadsheet

STEP #5 — BUILD THE RESPONSE RUBRIC

5.1 Purpose, what it scores

Score what the final answer explicitly states. This will grade the final output and answers that the model gives. You can see an example of a completed rubric here.

The Response Rubric evaluates final outputs only, not reasoning.

What to include (final outputs only)

Credit strategy selections (ITC vs PTC; bonus eligibility yes/no)
Final numeric values (credit rates, credit dollars, kWh, payback, IRR, NPV, audit score)
Explicit assumptions (electricity rate, discount rate, project life, escalation, degradation, O&M)
Explicit compliance actions (IRS forms, documentation and substantiation artifacts)

What NOT to include

Reasoning steps
Comparisons or tradeoff logic
Justifications or explanations
“Why” language or narrative analysis

(Those belong in the Chain-of-Thought Rubric.)

Mandatory rules

Response Rubric criteria must specify concrete, observable outputs in the final response.
- Try to avoid verbs such as "Assumes" or "Considers" as these are not verifiable.
- They should be fully specified outputs, not vague references.
One criterion = one claim
- No stacking. If more than one claim is present, split into multiple criteria.

Binary only
- Each criterion must be clearly satisfiable as true or false.

Self-contained
- A grader must be able to evaluate the criterion using only:
  - The task prompt
  - The model’s final answer
  - The criterion text itself.

Numeric checks require tolerances
- All numeric criteria must include explicit tolerances (typically ±0.5%–1% in the same units).
- Avoid “approximately” unless paired with a tolerance.

Structure (per criterion)

Score (positive or negative)
Type ∈ {Tax, Compliance, Financial, Audit Risk}
Criterion (single observable claim)
Source (primary reference)
Quote (short supporting excerpt)
Justification (why this output is required and how it ties to the prompt)

Minimum requirements

20+ Response Rubric criteria
Include negative (penalty) criteria for serious errors

5.2 Sources, Quotes, and Justifications (Realm fields)

Each Response rubric criterion must include:

Score – consistent scale (for example, 1–5 or 1–10). Scores can be negative.
Type – Tax, Compliance, Financial, or Audit Risk.
Criteria – the single binary output claim with explicit tolerance where applicable.
Source – direct URL landing on the evidence itself (prefer IRS/Treasury/DOE/NREL primary sources).
Quote – short phrases that justify the criterion (1–2 short phrases).
Justification – tie the criterion to the prompt ask, and if it is a numeric check, show the formula and compute the reference value from the prompt inputs and stated assumptions.

5.3 Coverage Guidelines

Minimum 20 Response rubric criteria for a typical task to ensure adequate granularity.
Convert each prompt requirement into multiple atomic checks so partial credit is possible.
Prefer output checks over method checks. If you require formulas or intermediate numbers, encode them as explicit output items (for example, “states year-1 gross savings = kWh × rate”).
Include several high-quality negative items.
Avoid criteria that require the grader to do new research.
Avoid criteria that reference other rubric items.

Validate with Rhea after completion and fix any invalidations. You can see an example of a structured rubric here

Step #6: Build the CoT (Chain-of-Thought) Rubric

Purpose

The CoT Rubric scores the reasoning steps that appear inside the final answer text. You can see an example of a completed CoT Rubric here.

It evaluates whether required reasoning steps are explicitly performed, not whether the final conclusion is optimal.

Each CoT criterion checks for the presence of a specific reasoning action in the answer.

Go to the Chain of Thought Rubric tab for your task.
One rubric criterion per entry.
Fields: Score, Type, Criteria, Source, Quote, Justification.
Use Raw Text unless LaTeX is needed for a formula.
After completing the rubric, Validate with Rhea (under the model names for the tab). If invalidated, fix and re-validate.

6.1 What the CoT Rubric Scores (by prompt ask)

The CoT Rubric scores:

Formula selection and application
Comparison logic (e.g., ITC vs PTC)
Timing logic (e.g., tax capacity, carryforward)
Structuring logic (e.g., alternative configurations)
Compliance reasoning (why a rule applies)
Financial construction logic (how payback, IRR, NPV are built)
Risk reasoning (why audit risk is higher or lower)

These must appear explicitly in the answer text.

What the CoT Rubric Does NOT Score:

Do NOT include the following in the CoT Rubric:
Final numeric outputs (rates, dollars, NPV values)
Final selections (chosen credit path, chosen alternative)
Narrative summaries
Writing quality or clarity
Repetition of final answers already graded in the Response Rubric

Those belong only in the Response Rubric.

6.2 Structure and Sources

Each CoT rubric criterion must include:

Score – integer points (positive or negative), scaled by importance of the reasoning step.
Type – one of Tax, Compliance, Financial, or Risk.
Criteria – a binary description of one reasoning step the model should take. Start with a neutral, observable verb such as: Explains, Describes, Identifies, States, Computes, Quantifies, Connects, Compares.
- Express only one reasoning idea per criterion.
- Make the criterion self-contained using the [1]–[20] labels and restating the scenario facts as needed.
Source – reference from the Resource List (statute, notice, form, NREL resource, IRS PWA page, etc.).
Quote – short supporting excerpt.
Justification – why this reasoning step matters.

6.3 CoT Coverage Guidelines

Aim for 20+ criteria covering all asks. Include both positive criteria (positive score) and negative criteria (negative score as a penalty).

Sanity check:

Tax Strategy CoT:

ITC vs PTC comparison with explicit formulas and inputs.
Basis selection and bonus stacking reasoning.
Tax capacity and carryforward logic.
Structuring of alternative configuration(s).

Compliance CoT:

Energy Community and Low-Income reasoning, including documentation.
PWA reasoning and how it affects base credit rate.
Domestic Content threshold and evidence reasoning.
Audit risk reasoning, including how alternative configuration changes risk.

Financial Projection CoT:

Savings calculations and escalation/degradation logic.
Payback logic.
IRR and NPV construction logic.
Base versus alternative NPV reasoning.

Negative-weight CoT criteria examples (type depends on nature of failure):

Treats Domestic Content as automatic with no mention of thresholds or evidence.
Ignores PTC entirely when the task explicitly requires an ITC vs PTC comparison.
Ignores an explicit constraint from the prompt (for example, budget type [15] or stated preference [20]).
Omits NPV reasoning despite the prompt requiring NPV outputs.
Reasons in ways that contradict cited sources (for example, claiming eligibility under an inapplicable category).
Claims impermissible double benefits without acknowledging the issue.

Good place to add complexity:

Recognizes that PWA affects the base credit rate, not just bonuses.
Recognizes that Domestic Content may be value-maximizing even with higher documentation burden, and explicitly integrates that into the reasoning.
Recognizes when a slightly worse NPV alternative could be acceptable due to materially lower audit risk, if consistent with the client preference embedded in [20].
Any other reasoning you deem to be important based on your expert approach.

Validate with Rhea after completion and fix any invalidations. You can see an example of a structured rubric here.

You can now proceed to the next steps here!

Page updated

Report abuse