Task & Data

Tasks

Task 1: Multilingual Regulatory Compliance Checking

Subtask 1: Full-Report Compliance Matching

Objective: Automatically identify the relevant pages within a full ESG report that correspond to each SASB metric and verify whether the disclosures meet the specified category and unit of measure defined by the SASB guidelines.

Task Description: Each system is provided with a full ESG report (which may exceed 200 pages) and the complete list of SASB disclosure requirements. Each metric describes a specific type of expected disclosure (e.g., total greenhouse gas emissions, energy usage, governance-related discussions). The system must:

Locate the pages in the report that are relevant to each SASB metric;
Determine whether the content on those pages fulfills the disclosure requirement;
Check whether the disclosure uses the correct category (e.g., Quantitative, Discussion and Analysis) and unit of measure (e.g., metric tonnes CO₂e, percentage).

Challenges:

Handling long, complex documents with varied layouts.
Extracting information from both text and visual elements like charts and tables.
Dealing with multilingual content and differing ESG reporting practices across regions and industries.

Subtask 2: Single-Page Metric Verification

Objective: Given a single SASB metric and a single page from an ESG report, determine whether that page contains relevant information and if so, whether it complies with the specified category and unit of measure.

Task Description: Each system receives a single SASB metric and a corresponding page from an ESG report. The system must:

Decide whether the metric is addressed on the given page;
If it is, verify whether the information aligns with the correct category and unit of measure specified in the SASB standard.

Challenges:

Limited context: only one page is available.
Requires fine-grained understanding of local content.
Useful for evaluating the model’s precision and error patterns on a micro-level.

Evaluation Metric (for both subtasks)

Both subtasks are treated as classification tasks. The primary evaluation metric is F1-score, which balances precision and recall. Additional metrics like accuracy, precision, and recall may also be reported to provide a more complete performance picture.

Task 2: Reporting Template Evidence Linking for Modern Slavery Statements (AIMS-TEL)

AIMS-TEL is a pilot task focusing on explainable regulatory document understanding in modern slavery disclosures. Participants must assess whether required reporting elements from the International Reporting Template (Levels 1–2) are addressed in corporate statements published in response to the Modern Slavery Acts, and link each prediction to supporting evidence in the text. Systems should classify each template element as {present / partially present / absent}, provide rationales and spans, and estimate uncertainty. An optional subtask encourages template auto-population, gap detection, and longitudinal analysis (e.g., comparing promised vs. delivered actions across reporting years).

Training/dev data come from AIMS.au (ICLR-2025), comprising 5,700+ annotated Australian statements with sentence-level labels and additional annotations of content such as tables, figures, and infographics. This opens the possibility of an optional multi-modal subtask, where participants apply vision-language models to assess whether template-related information is conveyed through non-textual formats. Test-only data includes statements from UK, Australian, and Canadian public registries. Note: Both the training and testing datasets include high-level compliance annotations based on mandatory disclosure criteria under respective Modern Slavery Acts and do not contain fine-grained labels for each template element.

Task 3: Aspect–Action Analysis with Cross-Category Generalization (A3CG)

The A3CG (Aspect–Action Analysis with Cross-Category Generalization) task focuses on analyzing sustainability disclosures in corporate ESG (Environmental, Social, and Governance) reports. Companies often present their sustainability performance in vague or exaggerated ways (a practice known as greenwashing), which makes it difficult to interpret whether reported claims reflect meaningful action. The task addresses this challenge by requiring systems to clarify sustainability statements through explicit linking of aspects (what is being addressed) and actions (how it is being addressed).

Task Aim

The aim is to provide a fine-grained representation of sustainability disclosures that distinguishes between planned commitments, implemented actions, and ambiguous claims. By making company statements more transparent, the task moves toward mitigating greenwashing patterns, even if not explicitly solving them. This helps ensure sustainability analysis is grounded in explicit actions tied to clear aspects, rather than vague or non-committal language.

Input

A sustainability statement from a company report. Each statement may contain one or more sustainability aspects.

Output

For each statement:

Identify aspect–action pairs.
- Aspect: A focal sustainability entity, activity, or goal (e.g., carbon emissions, workplace diversity).
- Action: The company’s level of engagement with that aspect, categorized into the following; these labels separate clear actions from empty claims, differentiating between commitments, executions, and ambiguous statements:
  - Planning: A commitment or plan to act. Addressing or engaging with an aspect here involves incorporating it into operations if it is an activity/entity, or advancing it if it is a goal/sub-area.
  - Implemented: An action that has already been taken to address or engage with the aspect.
  - Indeterminate: Vague, non-committal, or non-attributable language.

Experimental Setups

The task involves two experimental setups for aspect-action classification. These include:

1. Full Dataset Testing

o As a baseline, the full dataset is first split into train, validation, and test sets while maintaining balanced aspect category distributions. This full-dataset evaluation verifies training stability and provides a reference for performance before moving to the more challenging cross-category generalization setup.

2. Cross-Category Generalization

o To further strengthen robustness, A3CG introduces a cross-category generalization setting. Models are trained on seen categories but tested on unseen categories (heldout during training), requiring them to clarify sustainability statements in new thematic areas not observed during training. This setup reflects real-world dynamics, where companies may strategically shift their reporting emphasis across ESG categories. By evaluating models on previously unencountered sustainability themes, A3CG ensures that clarification of aspect–action pairs remains effective even when reporting content changes. The dataset is delineated into three-fold validation to facilitate this, with each fold comprizing its own set of seen/ unseen categories.

o The official splits and folds for Full Dataset Testing and Cross-Category Generalization are delineated here: https://github.com/keanepotato/a3cg_greenwash

Notes

To provide more information on potential model design/ research directions, the authors have ran a series of experiments on the dataset. You can find the full details here: https://aclanthology.org/2025.acl-long.723.pdf

Page updated

Google Sites

Report abuse