Tax Scope Only

Purpose

This guide explains how to create one complete Arden task: a realistic clean-energy project, a tax-credit optimization question, and the rubrics used to evaluate model answers and reasoning.

You define the project → the model responds → we score correctness and reasoning.

Tasks in this scope ask the model to determine optimal tax credit treatment. The model must produce:

Recommended federal credit path (ITC vs. PTC or other applicable statutes)

Justification for the selected credit path

Determination of bonus eligibility (Energy Community, Domestic Content, Low-Income, if applicable)

Base credit rate and any bonus adders, stated separately

Final blended credit rate

Total credit value in dollars

Identification of eligible versus non-eligible basis, if relevant

Analysis of tax capacity and timing (year-one use versus carryforward)

Scope Boundary: Tax Strategy tasks are limited exclusively to tax credit eligibility, credit rates, eligible basis, and carryforward treatment. Do not ask the model to compute financial projections such as NPV, IRR, payback period, or bill savings. Those belong in Financial Projection tasks, which are a separate scope.

Quick Reference: Step Index

Step 1: Confirm Assignment, Create Task on Realm, Pick a Building (15–30 min)

Step 2: Fill All 13 Mandatory Inputs (30 min–1 hr)

Step 3: Write the Narrative Task Prompt (1.5–2 hrs)

Step 4: Incorporate Energy Engineer Feedback (15–45 min)

Step 5: Build the Response Rubric with 20+ criteria (2 hrs)

Step 6: Build the CoT Rubric with 20+ criteria (2 hrs)

Step 7: Run Model Tests and Score to confirm <60% threshold (2–3 hrs)

Step 8: Final Submission (30 min)

Note on Time Estimates: These estimates assume familiarity with the process. Your first few tasks will likely take longer as you learn the tools, research workflows, and rubric design patterns. This is expected — for your first task, prioritize learning the process over speed. It's more valuable to produce one well-constructed task than to rush through it.

A Note on Task Complexity

The goal is to create tasks that are realistically challenging, not adversarially difficult. Tasks should test whether the model can perform careful, expert-level tax analysis on plausible scenarios.

Avoid:

Introducing gotchas or obscure statutory footnotes to force failures
Creating cascading errors where one mistake causes total failure across criteria
Brittle interpretations that penalize reasonable alternative readings
Edge cases where even experts would disagree
Including inputs or bonus categories that do not affect the specified credit path (e.g., Domestic Content for §25D tasks). This ensures a complete prompt rubric match.

Instead, aim for:

Realistic scenarios that require careful analysis
Multiple bonuses to evaluate (some applicable, some not) — but only bonuses relevant to the credit path
Tax capacity constraints that require carryforward reasoning
Partial eligible basis that requires cost allocation
Clear right answers that reward thorough reasoning

Complexity in the prompts should come from realistic depth.

Constraining Analysis in Edge Cases

When a scenario could support multiple credit paths (e.g., mixed-use property that could be analyzed under §25D or §48), explicitly constrain the analysis to one path in the prompt. This ensures a clear match between the prompt and rubric — multiple defensible interpretations make fair scoring impossible.

For example, if a property has partial business use (like short-term rental), you might write: "Analyze this project under §25D, addressing how partial business use affects eligibility and credit calculation. Do not analyze under §48 or §48E."

The model can still be asked to reason about complexities (like partial residential use), but the credit path itself should be specified.

Step 1: Confirm Assignment, Create Task on Realm, and Pick a Building

Estimated time: 15–30 minutes

1.1 Create a New Task on Realm

Create a new task and name it using the format: [State Abbreviation]-[Project Type]-TAX-###

1.2 Check Your Assignment

Confirm the parameters you've been given: U.S. state and project type (residential, commercial, multifamily, edge case).

These are not rigid assignments. If you have a combination you're particularly excited to research or where you have more expertise, you may adjust accordingly—just notify us ahead of time.

1.3 Choose a Realistic Subject Property

Pick a property that is believable and flexible enough to support the tax complexity you want.

Use mapping and listing tools such as Google Maps, Street View, Zillow, Redfin, or LoopNet to find a building that matches your assignment, has reasonable size, age, and layout, and has roof geometry that could realistically host PV. Avoid famous buildings or landmarks; choose something typical.

Adding complexity through property selection: For commercial or multifamily projects, you can introduce multiple tenants with different load profiles, include mixed-use space, or use partial roof availability to constrain PV sizing.

1.4 Determine Special Location Status

Make assumptions about additional characteristics that create tax complexity.

Consider Energy Community status, Low-Income Community status, Domestic Content applicability, and complex ownership structures.

How to Research Location Status

Census Tract ID: Use Census Bureau Geocoder at geocoding.geo.census.gov/geocoder (Address → Geographies). Optionally confirm with FFIEC Census Geocoder. Record the full 11-digit census tract ID.

Energy Community Status: Use DOE/NETL Energy Community Map. Reference IRS Notice 2023-29 for definitions. Reference IRS Energy Communities FAQs.

Low-Income Community Status (if relevant): Use CDFI Fund CIMS Mapping Tool. Use DOE Low-Income Communities Bonus Credit Map.

Utility Service Territory: Use EIA Electric Retail Service Territories to identify the serving utility. Use OpenEI Utility Rate Database for rate structures.

If you cannot reasonably satisfy a required status, mark [ESCALATE] in your notes and explain whether to use a different location or relax the requirement.

Adding complexity through location: Have the model evaluate multiple bonus categories where some apply and some do not. For example, Energy Community = No, Domestic Content = conditional on procurement documentation, Low-Income Community = No. The model must check each rather than assume any.

Step 2: Fill All 13 Mandatory Inputs

Estimated time: 30 minutes–1 hour

Every Tax Strategy task requires 13 inputs organized into three categories:

Location [1]–[5]: 5 inputs

Technology [6]–[7]: 2 inputs

Financial [8]–[13]: 6 inputs

You may use an LLM to generate non-critical numeric details as a starting point, but you must sanity-check them and add complexity that makes the scenario realistic. You remain responsible for coherence.

2.1 Location Inputs [1]–[5]

[1] Street Address and Coordinates

Choose a real address from listing sites or maps. In Google Maps, right-click and select "What's here?" then copy latitude and longitude. Format: [Full Address] ([Latitude], [Longitude])

[2] Census Tract ID

Use Census Bureau Geocoder. Record the full 11-digit GEOID.

[3] Energy Community Status

State Yes or No. If Yes, specify the category (fossil fuel employment, coal closure). Include rationale referencing the tract and applicable guidance.

[4] Utility Service Territory

Identify the serving utility. Note relevant rate structure if it affects the scenario.

[5] Property Characteristics

Include only characteristics that affect energy consumption, PV hosting capacity, and site constraints: building type, size in square feet, roof type (flat vs. pitched), age, lot size, and relevant obstructions or constraints.

2.2 Technology Inputs [6]–[7]

[6] PV Capacity (kW DC)

Use NREL PVWatts at pvwatts.nrel.gov to size appropriately. Consider roof constraints, load offset goals, and budget.

[7] Expected Annual PV Generation (kWh)

Derive from PVWatts using your coordinates from [1].

Note: For more complex projects involving battery storage, heat pumps, or EVSE, you may add optional technology inputs labeled [6a], [6b], etc. Document these clearly in your legend.

2.3 Financial Inputs [8]–[13]

[8] Total Installed Cost (USD)

Use NREL and DOE cost benchmarks to anchor estimates. Include equipment, labor, and soft costs.

[9] Budget Type

Specify Hard Cap or Flexible Range. This affects whether alternative configurations can exceed budget.

[10] Tax Capacity Estimate (USD/year)

The client's estimated annual federal income tax liability. Align with the narrative's income profile. This determines whether year-one utilization is possible or carryforward is needed.

Adding complexity through tax capacity: Use tax capacity that is binding but not impossible. For example, if the expected credit is $5,550 and tax liability is $4,000, the model must recognize partial year-one use plus carryforward. This tests careful analysis without creating an adversarial scenario.

[11] Credit Strategy Preference

Express the client's decision rule or preference.

[12] Contract Signing Date

The date the contract is executed. Format as YYYY-MM-DD.

[13] Placed-in-Service Date

The date the system becomes operational. This determines which tax rules apply.

Step 3: Write the Narrative Task Prompt

Estimated time: 1.5–2 hours

3.1 Narrative Structure

Turn your inputs into a coherent story. Use bracketed references throughout.

Opening Context

You are advising a [Project Type] in [City, State] with the goal of developing an optimal tax credit strategy. Base your analysis on the Internal Revenue Code as enacted by the Inflation Reduction Act of 2022, applicable Treasury regulations, and IRS guidance in effect as of the placed-in-service date [13].

Location Block

The property is located at [1], in census tract [2]. This tract [does/does not] qualify for Energy Community status [3] based on the definition and mapping data applicable as of [13]. The property is served by [4].

Here are the property's characteristics: [5].

Technology Block

The owner is planning a comprehensive energy upgrade that includes a roof-mounted solar PV system sized at [6], designed to produce approximately [7] per year based on the site's solar resource and typical system losses.

Financial Block

The overall turnkey project is quoted at a total installed cost of [8]. This is a [9] for the homeowner.

The client's estimated annual federal income tax liability is [10]. The homeowner expresses a credit strategy preference if there is ever a need to choose between equivalent values [11].

Assume the contract is signed on [12], and the project is placed in service on [13]. Energy Community status [does/does not] apply for the placed-in-service year [3].

3.2 Task Request

At the end of the narrative, include this instruction block:

For every key recommendation or numeric output, briefly explain how you derived it and how it relates to the inputs [1]–[13]. Base all determinations only on statutory rates, thresholds, eligibility rules, and credit mechanics that were in effect as of the placed-in-service date [13]. This task is limited exclusively to tax credit eligibility, credit rates, eligible basis, and carryforward treatment.

Important: Do not include asks for financial projections (NPV, IRR, payback, bill savings) in Tax Strategy tasks. Keep the scope limited to tax analysis.

Important: Only ask about bonus categories that are relevant to the specified credit path. For example, Domestic Content does not affect §25D residential credits, so do not include it in §25D tasks. Including irrelevant bonuses creates confusion and accidental failure modes.

Potential Asks:

Recommended Federal Credit Path: Identify the applicable federal tax credit path (such as §25D residential credit, §48 investment credit, or other applicable statutes) under regulations in effect as of [13]. For edge cases with multiple possible paths, explicitly specify which path to analyze.

Bonus Eligibility Determination: Evaluate whether any bonus or adder provisions are statutorily applicable. Only include bonuses relevant to the specified credit path (e.g., Energy Community for §48; do not ask about Domestic Content for §25D).

Credit Rate Breakdown: State the applicable base credit rate, any bonus adders (stated separately), and the final blended credit rate.

Total Credit Value: Quantify the total federal tax credit value in dollars based on the eligible portion of the total installed cost [8].

Eligible vs. Non-Eligible Basis: Identify which components of the project qualify for the credit and which do not, if relevant.

Tax Capacity & Timing Analysis: Evaluate how the client's annual tax liability [10] interacts with the credit value, determining year-one utilization and any carryforward treatment permitted under rules in effect for the placed-in-service year.

At the end, provide a short summary of the recommended strategy.

3.3 Legend

At the end of every prompt, include a legend mapping all 13 inputs:

Legend: [1] [Full Address] ([Latitude], [Longitude]) – street address and coordinates [2] [11-digit GEOID] – census tract ID [3] [Yes/No] – Energy Community status and rationale [4] [Utility Name] – utility service territory [5] [Property details] – property characteristics [6] [X] kW – PV capacity [7] [X] kWh – expected annual PV generation [8] [X] USD – total installed cost [9] [Hard Cap/Flexible Range] – budget type [10] [X] USD – tax liability estimate [11] [Preference statement] – credit strategy preference [12] [YYYY-MM-DD] – contract signing date [13] [YYYY-MM-DD] – placed-in-service date

Note: For projects involving additional technologies (battery storage, heat pumps, EVSE), add optional inputs and include them in this legend with clear descriptions.

3.4 Validate and Submit for Review

Validate with Rhea on the Realm platform.

If Rhea invalidates the prompt, make suggested changes and re-validate.

Once validated, submit for Review.

An energy expert will conduct an asynchronous feasibility review.

Your prompt passes only after engineering approval.

After approval, rubric steps are unlocked.

Step 4: Incorporate Energy Engineer Feedback

Estimated time: 15–45 minutes

After engineering review:

Update Task Prompt: If system sizes, production, or technology mix change, update legend values, narrative references, and any downstream tax implications.

Resolve Disagreements: If you disagree with a suggested change that materially affects the scenario, mark it [ESCALATE], bring in the HD team for decision, and do not override feasibility concerns unilaterally.

Final Coherence Check: Confirm prompt and legend are consistent. Confirm [ESCALATE] items are resolved or documented. Re-validate with Rhea.

Step 4.5: Quick Difficulty Check (Optional)

Before starting the Response Rubric, run your prompt through GPT and Claude. Read the responses and count how many of your asks each model answers correctly.

If a model gets more than half your asks right on the first try, your prompt probably needs more complexity. It's much faster to fix the prompt now than to rebuild the rubric later.

Appendix A: Realistic Complexity

The goal is realistic difficulty that tests expert-level reasoning, not adversarial traps.

Good complexity comes from:

Tax capacity constraints that require carryforward reasoning
Multiple bonus categories to evaluate (some applicable, some not)
Partial eligible basis that requires cost allocation
Property types that require determining which credit applies
Ownership structures that affect the analysis
Multi-year credit utilization modeling or partial eligible basis identification
Partial-use proration (day-counting)
Nonrefundable credit ordering with pre-existing credits
Cross-year credit allocation for different PIS tax years
Mixed residential/business use allocation with dual credit paths
§25C no-carryforward vs §25D carryforward ordering strategy
QMID documentation requirements for §25C
§30C census tract eligibility verification
Binding tax capacity requiring partial year-one use plus carryforward
Multi-component projects with varying eligibility under same statute
Credit cap interactions (percentage vs. dollar cap vs. proration)
Entity structure effects on credit availability

Step 5: Build the Response Rubric

Estimated time: 2 hours

5.1 Purpose

The Response Rubric scores final outputs only, not reasoning. It evaluates what the model's answer explicitly states.

What to Include: Credit path selection, bonus eligibility (yes/no for each category), credit rates (base and bonuses separately), final blended rate, credit dollars, eligible basis amounts, carryforward amounts

What NOT to Include: Reasoning steps, comparisons or tradeoff logic, justifications or explanations, "why" language or narrative analysis. Those belong in the Chain-of-Thought Rubric.

5.2 Mandatory Rules for Response Rubric Criteria

One Criterion = One Claim. No stacking multiple checks with "and/or." If you want to check two things, use two rows.

Binary Only. Each criterion must be satisfiable as true or false. No partial credit within a single row.

Self-Contained. A grader must evaluate the criterion using only the task prompt, the model's final answer, and the criterion text itself.

Numeric Checks Require Tolerances. All numeric criteria must include explicit tolerances, typically ±0.5%–1%.

Neutral, Observable Verbs. Start each criterion with States, Mentions, Identifies, Computes, Quantifies, Provides, or Assigns. Avoid subjective language such as "properly," "clearly," "thoroughly," "key," or "significant."

5.3 Response Rubric Structure

Each criterion requires these fields:

Score — Integer points (positive or negative)

Type — Tax

Criterion — Single observable claim with tolerance if numeric

Source — Primary reference URL (IRS, Treasury, DOE, NREL)

Quote — Short supporting excerpt (1–2 phrases)

Justification — Why this output is required; for numeric checks, show formula and reference value

5.4 Coverage Guidelines

Minimum 20 Response Rubric criteria.

Convert each prompt requirement into multiple atomic checks. This allows the overall score to reflect performance across the task — a model that satisfies 15 of 20 criteria scores better than one that satisfies 5 of 20, even though each individual criterion is binary.

Include negative (penalty) criteria for serious errors, but avoid cascading penalties where one mistake causes multiple deductions.

Avoid criteria requiring the grader to do new research.

Avoid criteria that reference other rubric items.

Example Positive Criteria: States the applicable IRC section for the recommended credit path. Identifies the base credit rate as a percentage. States whether Energy Community bonus applies. States whether Domestic Content bonus applies. States whether Low-Income Community bonus applies. Quantifies the total credit value in dollars within ±[tolerance] of [reference value]. Identifies year-one credit utilization amount. Identifies carryforward amount if applicable.

Example Negative Criteria: Claims Energy Community bonus when [3] explicitly states No. States a credit rate that does not exist under any applicable statute. Applies commercial ITC rules to a residential project without justification.

Step 6: Build the CoT (Chain-of-Thought) Rubric

Estimated time: 2 hours

6.1 Purpose

The CoT Rubric scores reasoning steps that appear in the answer text. It evaluates whether required reasoning actions are explicitly performed, not whether conclusions are optimal.

What to Score: Credit path comparison logic, eligibility analysis for each bonus category, basis allocation reasoning, tax capacity and carryforward logic

What NOT to Include: Final numeric outputs (those go in Response Rubric), final selections, narrative summaries, writing quality, repetition of answers already graded in Response Rubric

6.2 CoT Rubric Structure

Each criterion requires:

Score — Integer points (positive or negative)

Type — Tax

Criterion — Binary description of one reasoning step

Source — Reference from Resource List

Quote — Short supporting excerpt

Justification — Why this reasoning step matters

Approved Verbs for CoT Criteria: Explains, Describes, Identifies, States, Computes, Quantifies, Connects, Compares, Evaluates, Considers

Express only one reasoning idea per criterion. Make criteria self-contained using [1]–[13] labels.

6.3 CoT Coverage Guidelines

Minimum 20 CoT criteria covering all prompt asks.

Prioritize diversity of reasoning types over sheer criterion count. A good CoT rubric covers different kinds of reasoning — eligibility analysis, statutory interpretation, numerical computation, timing logic — rather than twenty variations of the same check. Aim for broad coverage across the reasoning skills the task requires.

Design criteria so that the rubric captures different aspects of reasoning independently. A model that correctly analyzes 3 of 4 bonus categories should satisfy more criteria than one that skips the analysis entirely — this is achieved by having separate criteria for each bonus category, not by partial scoring within a single criterion.

Example Positive CoT Criteria:
Explains why a specific IRC section applies to this project type.
Evaluates Energy Community eligibility based on [3].
Evaluates Domestic Content eligibility based on project specifics.
Evaluates Low-Income Community eligibility based on census tract [2].
Evaluates whether tax liability [10] is sufficient for year-one utilization.
Describes the carryforward rules applicable to the credit type.
Explains which costs are included in eligible basis.

Step 7: Run Model Tests and Score

Estimated time: 2–3 hours

Goal

Evaluate four LLM responses on your task prompt.

Use your rubrics to grade each response.

Confirm all models score below 60% of total possible points.

Validation Threshold

All four models must score below 60% on both rubrics. This threshold confirms the task is sufficiently challenging — it is a validation gate, not a design target.

Process

Generate Model Outputs: Run your prompt through four models externally: GPT 5.2, Claude Opus 4.5, Gemini 3 Pro, and Llama 4. Use the exact same prompt for all. Purchase deep research for LLMs (you will be reimbursed). Copy responses to the "Evaluate Models" section on Realm.

Score with Rubrics: For the Response Rubric, Rhea will auto-assess each criterion. Double-check Rhea's assessments and change if wrong. Mark each row as satisfied or not. For the CoT Rubric, Rhea cannot analyze these. You must manually assess each criterion. Mark each row as satisfied or not.

Calculate Scores: For each model, Response Score = (Awarded Points / Total Points) × 100 and CoT Score = (Awarded Points / Total Points) × 100.

Check 60% Threshold: If any model scores 60% or higher, do not down-score retroactively — revisit the task to ensure it reflects realistic expert-level analysis, then rerun

Step 8: Final Submission

Estimated time: 30 minutes

Pre-Submission Checklist

Task prompt is approved (post-engineering review).

Response Rubric has 20+ criteria including negatives.

CoT Rubric has 20+ criteria including negatives.

Both rubrics cover every prompt ask.

All four models score below 60% on both rubrics.

Legend is complete and matches narrative.

All [ESCALATE] items are resolved.

You have gone through the Checklist for Signed Off Tasks

Submit

Submit your finalized task on Realm.

Appendix A: Realistic Complexity

The goal is realistic difficulty that tests expert-level reasoning, not adversarial traps.

Good complexity comes from:

Tax capacity constraints that require carryforward reasoning
Multiple bonus categories to evaluate (some applicable, some not)
Partial eligible basis that requires cost allocation
Property types that require determining which credit applies
Ownership structures that affect the analysis
Multi-year credit utilization modeling or partial eligible basis identification
Partial-use proration (day-counting)
Nonrefundable credit ordering with pre-existing credits
Cross-year credit allocation for different PIS tax years
Mixed residential/business use allocation with dual credit paths
§25C no-carryforward vs §25D carryforward ordering strategy
QMID documentation requirements for §25C
§30C census tract eligibility verification
Binding tax capacity requiring partial year-one use plus carryforward
Multi-component projects with varying eligibility under same statute
Credit cap interactions (percentage vs. dollar cap vs. proration)
Entity structure effects on credit availability

Avoid:

Obscure statutory footnotes that even experts would need to look up
Ambiguous scenarios where experts would reasonably disagree
Cascading rubric failures where one mistake causes multiple deductions
Trick phrasing — difficulty should come from substance, not confusion

Appendix B: Resource List (Primary Sources)

IRS and Treasury Guidance

IRC §25D - Residential Clean Energy Credit at law.cornell.edu/uscode/text/26/25D

IRC §48 - Energy Credit at law.cornell.edu/uscode/text/26/48

IRC §48E - Clean Electricity Investment Credit at law.cornell.edu/uscode/text/26/48E

IRS Notice 2023-29 (Energy Communities) at irs.gov/pub/irs-drop/n-23-29.pdf

IRS Notice 2023-38 (Domestic Content) at irs.gov/pub/irs-drop/n-23-38.pdf

IRS Notice 2024-41 (Domestic Content Update) at irs.gov/pub/irs-drop/n-24-41.pdf

IRS Notice 2025-08 (Domestic Content Update) at irs.gov/pub/irs-drop/n-25-08.pdf

IRS Form 5695 Instructions (Residential) at irs.gov/instructions/i5695

IRS Form 3468 Instructions (Investment Credit) at irs.gov/instructions/i3468

IRS Domestic Content Bonus Credit overview at irs.gov/credits-deductions/domestic-content-bonus-credit

IRS Energy Community Bonus Credit FAQs at irs.gov/credits-deductions/energy-community-bonus-credit

DOE and NREL Tools

NREL PVWatts Calculator at pvwatts.nrel.gov

DOE/NETL Energy Community Map at edx.netl.doe.gov/dataset/ira-energy-community-data-layers

DOE Low-Income Communities Bonus Credit Map at energyjustice.gov

CDFI Fund CIMS Mapping Tool at cdfifund.gov/cims

OpenEI Utility Rate Database at openei.org/wiki/Utility_Rate_Database

Census and Mapping

Census Bureau Geocoder at geocoding.geo.census.gov/geocoder

FFIEC Census Geocoder at ffiec.gov/geocode

EIA Electric Retail Service Territories at eia.gov/maps

Page updated

Report abuse