Evaluation ‐ PBIG 2025

This page explains the evaluation methods.

Overview

Annotations were made by human experts and LLM-as-a-Judge.

Human experts are divided into marketing and technical groups. The marketing group evaluated market aspects (need validity, market size), while the technical group evaluated technical aspects (validity, innovativeness, competitive advantages). In material chemistry, the experts participated in both groups.

Updates on evaluation protocol

Evaluation methods have been updated to ensure greater consistency in annotations across all annotators.

From pairwise annotation to scoring During the annotation process, we observed inconsistencies in pairwise comparisons made by different annotators. To address this, we changed our annotation strategy from pairwise comparison to scoring, enabling more standardized judgments. Subsequently, we automatically converted all scored pairs from the same annotator into pairwise judgments by comparing the two scores. Scores for each criteria were defined here.

Annotation filtering

It is challenging to assign scores to unqualified ideas. For instance, less specific ideas cannot be scored from the perspective of technical validity. This observation led to the introduction of annotation orders and skipping criteria.

Expert annotation

Due to the high volume of submitted ideas, we selected specific patents for annotation, rather than annotating every idea on every patent. For each submitted idea on a chosen patent, we assigned two annotators from both the marketing and technical groups. However, due to some assignment issues, a few ideas were annotated by only one individual.

Number of human experts

NLP / CS

Technical group: 5 (Organizer members)
Marketing group: 7 (Consultants from Stockmark Inc.)

Material science

Technical group & marketing group: 4 in total (Tech experts from Asahikasei TENAC R&D Dept.)

LLM-as-a-Judge

We utilized the following three models to generate scores. Due to the high volume of ideas, we opted not to use commercial APIs such as OpenAI GPT.

To ensure robust evaluation, we conducted five inferences with distinct seeds for each prompt. The results were then aggregated as described below.

Instruction #1 is designed for pairwise comparison. To mitigate position bias, we created two prompts for each pair where the positions of the ideas were reversed. After running inference with different positions and seeds, majority voting was employed to determine the final judgment.

Instructions #2 and #3 are used for scoring. We performed inference on these instructions using five different seeds. For each idea, the mean score was utilized as the final score.

Page updated

Google Sites

Report abuse