System Validation

Abstract

Given a text (t1) and a hypothesis (t2), your system judges whether t1 entails t2 or not.

The system is provided with the following two types of sentence pairs:

    • Several linguistic phenomena are involved in the decision whether t1 entails t2.
    • A single linguistic phenomenon is involved in the decision whether t1 entails t2.

A list of sentence pairs of the latter type is made from a sentence pair of the former type.

While the RITE task aims at integrated semantic/context processing systems,

it also has a problem that research focused on a specific linguistic phenomenon is not easy to pursue.

This subtask provides a data set that includes a breakdown of linguistic phenomena that are necessary for recognizing relations between t1 and t2.

See also "linguistic phenomena" page.

Test Collection

The following test collections are provided for participants.

The data written in red are newly created for RITE-VAL.

The data written in black are the same data for RITE-1 or RITE-2.

See also "NTCIR-11 Test Collections: data sets for NTCIR-11 Workshop Participants" page at NTCIR-11 website.

Data Format

The file format for System Validation subtask is the same for NTCIR-10 RITE-2 BC/MC subtasks.

<?xml version="1.0" encoding="UTF-8"?>

<dataset type="bc">

<pair id="1" label="Y">

<t1>プロメーテウスは人類に火を渡し、張り付けにされた。</t1>

<t2>プロメテウスは人類に火を齎して罰を受けた。</t2>

</pair>

<pair id="2" label="Y">

<t1>伊坂幸太郎は直木賞候補になった2003年の『重力ピエロ』で一般読者に広く認知されるようになった。</t1>

<t2>『重力ピエロ』は伊坂幸太郎による小説で直木賞候補作品だった。</t2>

</pair>

<pair id="3" label="N">

<t1>中央アジアで作られる馬乳酒は、少量のアルコールを含んだ飲むヨーグルトといえる。</t1>

<t2>飲むヨーグルトは、酒の一種だ。</t2>

</pair>

:

</dataset>

A <pair> element in task data for system training purposes has a @label attribute, while one in task data for formal run does not.

For JA subtask, we used the following two sets of t2 texts as a set of t2 in "RITE-VAL_JA_test_systemval.xml" (SV-test):

  1. A subset of the t2 texts in "RITE2_JA_dev_examsearch.xml" (FV-dev). A sentence pair whose t2 had id="N" in the FV-dev file was tagged with id="dev-N-xx" in the SV-test file, where xx is a suffix number representing a sub-ID.
  2. A subset of the t2 texts in "RITE-VAL_JA_test_factval.xml" (FV-test). A sentence pair whose t2 had id="M" in the FV-test file was tagged with id="test-M-yy" in the SV-test file, where yy is a suffix number representing a sub-ID.

Evaluation Method

Binary Classification (for CS, CT and JA)

Your system answers "Y" or "N" for a given sentence pair (t1 and t2).

    • If a human reading t1 would infer that t1 entails t2, answer "Y."
    • Otherwise, answer "N."

Performance of the system is evaluated by macro F1 for the two labels.

The evaluation tool distributed at RITE-VAL and RITE-2 websites is used for this evaluation.

Result submission format:

"Text Pair ID" [SPACE] "Label" [SPACE] "Confidence"

"Text Pair ID" [SPACE] "Label" [SPACE] "Confidence"

"Text Pair ID" [SPACE] "Label" [SPACE] "Confidence"

:

Example file obeying the format:

1 Y 0.852

2 N 0.994

3 Y 0.789

4 Y 1.000

:

Multi-Classification (for CS and CT)

Your system answers "F", "B", "C" or "I" for a given sentence pair (t1 and t2).

    • If a human reading t1 would infer that t1 entails t2 and the human reading t2 would infer that t2 entails t1, answer "B."
    • If a human reading t1 would infer that t1 entails t2 but the human reading t2 would infer that t2 doesn't entail t1, answer "F."
    • If a human reading t1 would infer that t1 and t2 contradict or they cannot be true at the same time, answer "C."
    • Otherwise, answer "I."

Performance of the system is evaluated by macro F1 for the four labels.

The evaluation tool distributed at RITE-VAL and RITE-2 websites is used for this evaluation.

Result submission format:

"Text Pair ID" [SPACE] "Label" [SPACE] "Confidence"

"Text Pair ID" [SPACE] "Label" [SPACE] "Confidence"

"Text Pair ID" [SPACE] "Label" [SPACE] "Confidence"

:

Example file obeying the format:

1 F 0.852

2 F 0.994

3 C 0.789

4 I 1.000

:

Evaluation Tool