This page gives more details about the evaluation of the analysis rules than was reported on in the paper. The test-set creation is described, and links to the test-set and annotations are provided.
100 sentences from the corpus, that represent the corpus challenges, were selected. In total, the sentences contained 311 propositions.
Propositions are composed of an actor, a predicate and a negotiation point related to the actor via the predicate.
The sentences in the reference set contain structures like:
As can be expected, the corpus contains sentences whose agent is a generic noun phrase like the delegates, or most delegates, or a moderator role like the Chair, etc. Whereas the system does output propositions with such actors, for our evaluation we did not annotate such actors, and the system was configured so that it does not output propositions for those actors (based on an attribute in the data-model that encodes the actor type).
Actor mentions were normalized to the DBpedia entity representing them. E.g. a mention like The EU appears as European_Union in the reference set
As described in the paper, an output was considered correct if all of its components matched the reference exactly.
To compare the reference and system results, all characters were lowercased, and trailing or leading whitespace and punctuation was stripped from all results. These modifications are immaterial to the task, but avoid assigning an error just because there was a trailing space in the manual annotation, for example.