This page presents our full set of guidelines for annotating visually descriptive language (VDL) in text. This was first presented and summarised in our paper below:

Robert Gaizauskas, Josiah Wang, Arnau Ramisa. Defining Visually Descriptive Language. In Proceedings of the Fourth Workshop on Vision and Language (VL'15), 2015.

Please visit the project website for our corpus annotated based on the guidelines below.

Annotation Guidelines

This document presents a set of guidelines for annotating visually descriptive language (VDL) in text. We begin with our definition of VDL and some elaboration and clarification of concepts in it. We then define several possible annotation tasks relating to the definition. Finally, we consider a set of problem cases and recommendations in relation to them.

1.0 Definition of VDL

Our intuition is that a segment of text is visual if we can determine whether what it says is true or false by visual sense alone. More precisely:

Definition: A text segment is visually descriptive iff it asserts one or more propositions about either (a) a specific scene or entity whose truth can be confirmed or disconfirmed through direct visual perception (example (1)), or (b) a class of scenes or entities whose truth with respect to any instance of the class of scenes or entities can be confirmed or disconfirmed through direct visual perception (example (2)).

Examples of visual sentences:

``John carried the bowl of pasta across the kitchen and placed in on the counter’’.
``Tigers have a pattern of dark vertical stripes on reddish-orange fur with a lighter underside.’’

Notes:

By text segment here we mean a phrase, clause, sentence or sequence of sentences, i.e. a sequence of contiguous words.
By ``direct visual perception’’ we mean that:
1. The observer can determine the truth of the relevant proposition without intervening in the scene so as to allow for additional visual inputs. E.g. the truth of ``John weighs 65 kilos’’ might be determined visually by placing John on a weighing scale and observing the needle on the scale, but if this scale and John’s standing on it is not part of the scene then this sentence is not visual.
2. Any inference that needs to be carried out to confirm or disconfirm the proposition is such that it would unquestioningly be made by the vast majority of observers drawn from the population of intended readers of the text without knowledge of the preceding textual content, i.e. the observer’s judgement is “context-free”. For example, most observers of a scene that includes a boy sitting on the end of a dock holding a fishing rod whose line disappears into the water before him would infer without question that the boy is fishing, allowing them to confirm the truth of ``The boy sat fishing on the dock’’ directly from the scene and without knowledge of earlier parts of the text in which the sentence is embedded. This example illustrates just how tightly coupled inference and perception are and that “what we see” is a product of both. Also, note how our definition is analogous to that of textual entailment where given a pair of textual expressions T and H “We say that $T$ entails H if, typically, a human reading T would infer that H is most likely true”'. I.e. we rely on a judgement that would typically be made about what is going on in the scene.
3. Our imaginary observer can visually identify any named entities.
By “asserts a proposition” we mean that the sentence or sentence fragment must express, explicitly or implicitly a predication of some sort, i.e. something that may be judged true or false. Thus, sentences or clauses with tensed verbs are candidates, as are noun phrases that, in order to correctly refer, predicate something of an entity. Thus we rule out bare noun phrases, such as “the man”, since nothing is predicated here, but include phrases such as “the tall man” or “a man wearing a green shirt”.

One consequence of note 1. above is that phrases like

(3) the tall, well-educated man

are not visual, since while they contain a mix of visual (``tall’’) and non-visual (``well-educated”) attributes, they do not form a contiguous sequence of words which is visually confirmable as a whole. The visually descriptive phrase

(4) ``the tall man’’

could be derived from such a phrase (and is textually entailed by it) but the original phrase (3) is not visually descriptive as it stands.

Note that phrases such as (3) are distinct from sentences such as (5) where there are both visual and non-visual phrasal subcomponents:

(5) While he walked beside the canal, John thought about his mother.

Text segments like (5) we refer to as partially visual, since they contain at least one sub-segment (``he walked beside the canal”) which is visually descriptive, by our definition. By contrast (3) is not visually descriptive since it does not contain a sub-segment that meets our definition of VDL.

Cases such as (3) are frequently observed and should thus be accommodated in our scheme. We call segments such as (3) impure visually descriptive language (IVDL). To be IVDL a segment S1 must contain discontinuous subsequences that if conjoined form a segment S2 such that

S2 is VDL, and
in context S2 asserts a proposition that is entailed by the proposition S1 asserts (this rules out conjoining of unrelated subsequences).

Point 2. above rules out cases such as

(6) A tall wardrobe beside the well-educated man

as from (6) we cannot derive (4), since the predication (4) expresses is not entailed by those expressed by (6).

We also consider modal and counterfactual sentences to be impure visually descriptive language since their truth value is not directly determinable through visual sense, but it does relate to that of a visually descriptive sentence. For example, we cannot ``see” now that Bill will play tennis tonight. But the truth value of ``Bill is playing tennis” can be determined visually and what the truth value is at particular times is a key element in determining whether ``Bill will play tennis tonight” is true. This point is further discussed in section 3, subsection 5 below.

2.0 Annotation Tasks

Given the definition of VDL, we can specify several different annotation tasks. Here we distinguish two, which we refer to as sentence-level annotation and segment-level annotation. Each has several sub-variants depending on whether one wishes to capture pure VDL only or impure VDL as well.

2.1 Sentence-Level Annotation

We define a sentence-level annotation task as follows. Each sentence in a document is assigned one of three values:

``0’’ if it contains no VDL;
``1’’ if the entire sentence is VDL;
``2” if the sentence contains one or more proper sub-segments which are VDL, but the single segment comprising the whole sentence is not VDL.

Variants of the task may be defined depending on whether VDL is taken to include pure VDL only or pure and impure VDL annotation. In many texts there are significant numbers of impure VD segments, so omitting them leads to the loss of a substantial quantity of potentially valuable VDL; on the other hand, including them requires substantially more annotation effort and is only likely to be useful if accurate automatic techniques for extracting pure from impure segments can be developed.

One variant of the annotation task is as defined above, understanding VDL to be pure VDL only; another is to take VDL to include both pure and impure VDL, keeping the coding above the same; another is to keep the codes 0, 1 and 2 for pure VDL and introduce another code, 3, for sentences that are or contain impure VDL. A further variant would be to use the 3 value scheme above with VDL in cases ``0” and ``2” being pure or impure VDL, while restricting case ``1” to be pure VDL. This last variant, if automatable would allow directly ``usable” VDL to be gathered from sentences coded ``1”, while sentences code ``0” could be discarded and those coded ``2” to be retained for potential future use.

2.2 Segment-Level Annotation

In segment-level annotation, the exact words comprising a VDL segment are annotated using a swipe and click annotation tool. Variants arise depending on whether one wants to (1) allow pure only or pure and impure VDL and/or (2) to restrict the scope of the annotation to a single sentence or allow it to extend over multiple sentences.

If the annotation task is restricted to pure VDL, annotation is straightforward: select the word sequence to be annotated using a mouse and click to indicate VDL. If the task includes impure VDL then it is more complex: the multiple sub-sequences making up the pure sequence contained within the impure VDL segment must be selected and their association recorded. The latter is considerably more laborious but both can be supported in modern annotation tools, such as the brat rapid annotation tool. A slightly simpler, but less informative alternative to the latter is to swipe and click the full segment extent for both pure and impure VDL segments, but to associate a distinct code for each. This would allow pure and impure VDL segments to be collected and distinguished, but allows the task of identifying the multiple sub-sequences making up the pure sequence within the impure VDL segment to be deferred to a later time.

A decision as to the scope of the annotation -- within sentence versus across multiple sentences -- is likely to be based on the intended use of the annotation. Sequences of actions are likely to be described across multiple sentences. So, if the purpose of the annotation is to gather visual action descriptions (e.g. for interpreting or generating descriptions of video) then multi-sentence annotation will be useful; if the purpose is to gather descriptions of static scenes or snapshots of activities, then single sentence annotation will suffice.

Note that deciding to extend the scope of annotation to multiple sentences will not just affect begin and end points of annotations; it may affect the content of the annotations too. Consider (7):

(7) John took a sip of coffee. He read the newspaper for a minute then took a second sip.

In a multi-sentence annotation task both these sentences would be annotated as a single VDL segment. If the task is limited to a single-sentence task, then there would be two VDL extents annotated here: (a) ``John took a sip of coffee”, and (b) ``He read the newspaper for a minute then took a sip”. Note the word ``second” is omitted from (b) (hence we are assuming an annotation task in which impure VDL segments are annotated). This is because in the scene corresponding to both sentences the two actions are observed, while when examining the scene corresponding to second sentence in isolation we cannot verify that the sip is a second one.

3.0 Problem Cases

Inevitably, various difficult cases emerge. While it is to be expected that some areas of variation between annotators will unavoidably remain, consistency across annotators is increased and annotation decisions simplified if a standard approach is taken to various anticipated difficult cases. Here we list a number of these and recommend ways to annotate them. We proceed on the assumption VDL is being annotated at the segment level, sentence-by-sentence (i.e. the scope of the annotation is no more than a single sentence). To indicate the extent of VDL to be marked up in examples we use the tags <VDL> and </VDL>. For text segments that are impure VDL we use the notation <IVDL id=n seg=k> … </IVDL> to indicate the k-th sub-segment of the n-th impure VDL segment, omitting the id attribute where this is clear in context.

Metaphor.

In general judgements that ``A is like B”, ``X appeared to be Y”, ``C was as if D” etc., will not be VDL since the judgement of similarity underlying such statements is not something that is likely to be shared by an observer in viewing the entity to which they metaphor is applied. However, the expressions describing the entity to which the metaphor is applied and that supplying the metaphor may themselves be VDL.

(8) the pews appeared to be <VDL>broad stairs in a long dungeon</VDL>

(9) he panted like <VDL>a big dog that has been running too long</VDL>

(10) <VDL>The steeple leaned backward, while the church advanced</VDL>like <VDL>a headless creature in a long, shapeless coat</VDL>.

For example, in (8), the full sentence is not VDL since the judgement that the pews appeared like stairs is a subjective response of the viewer in this situation and cannot be assumed to be shared by an arbitrary observer in the same situation. However, the phrase

“broad stairs in a long dungeon” is VDL since most observers could recognise visually

broad stairs in a long dungeon.

Similarly in (9), the full sentence is not VDL since the judgement involved in the comparison is subjective; but the sentence “a big dog that has been running too long” is VDL since most observers could visually identify situations in which this is true.

(10) is a difficult case that illustrates several points. As with (8) and (9), the sentence “the church advanced like a headless creature in a long, shapeless coat” is not VDL since it is unlikely an arbitrary observer viewing the church would agree either that it was advancing or that it was similar to a headless creature in a long, shapeless coat. However, the phrase ``a headless creature in a long, shapeless coat” is VDL since most observers could recognise visually a headless creature in a long, shapeless coat.

In context it is clear (10) is a description of a boy’s experience of viewing a particular church, and that the steeple is not really leaning backward nor is the church advancing -- i.e. this is a metaphorical description that conveys his subjective experience. However, if the annotation task is single sentence annotation then ``The steeple leaned backward, while the church advanced” is VDL since prior context is ignored, no particular church is assumed and an observer could determine visually whether a steeple is leaning backward and a church advancing.

Words with mixed visual/aural or visual/experiential meanings.

Many words mix visual and aural or visual and experiential senses. For example, verbs like ``shout”, ``shuffle” and ``pant” have an aural and a visual component, not necessarily in the same proportion. Verbs like ``shudder” and ``flinch”, adjectives like ``insolent” (``insolent green eyes”) and ``sombre” and adverbs like “deathly” (``deathly pale”) signal not just movement or appearance but also underlying emotional experience or response.

Can situations appropriate for the use of such words be determined on the basis of visual input alone? The answer to this question is the key to determining whether they should be annotated as VDL or not. Annotators should decide whether if immersed in the setting of the sentence being annotated they would be able to apply the word unproblematically without aural information and in the case of emotions projected onto participants or emotional response by the author, that the emotional element of the word can be unambiguously determined from, e.g., the facial behaviour of the participants or would be universally shared by observers in the setting (e.g. “a dreary housing estate”).

Multiple visual perspectives.

Sometimes one sentence may contain information that is visually confirmable, but only from more than one distinct perspective or frame of reference. For example:

(11) Billy climbed the tree wearing his backpack, which contained his slingshot, some

pebbles and a magnifying glass.

An observer could visually confirm that Billy was climbing a tree wearing his backpack. And, he or she could visually confirm that the backpack contained various objects. But the imaginary position from which the climbing could be confirmed would not allow the observer to visually confirm the contents of the backpack.

In such cases we advocate annotating distinct VDL segments, one for each visual perspective or frame of reference, as in (11’):

(11’) <VDL> Billy climbed the tree wearing his backpack</VDL>, <VDL>which contained his

slingshot, some pebbles and a magnifying glass</VDL>.

The reason for this is that we want to derive models of VDL usage that can be used to help interpret or describe images or video that will be taken from a single perspective (at any given time point). Therefore descriptions that mix perspectives are more likely to be confusing than helpful.

Intentional contexts.

For the most part, sentences expressing propositional attitudes will not be visual. However, the sub-constituent that expresses the proposition towards which the speaker has an attitude may well be.

(12) Rob believed that <VDL>Anne was playing in the garden</VDL>.

For example, in (12), while the sentence as a whole is not visually descriptive, the embedded sentence ``Anne was playing in the garden” is.

Hypotheticals/counterfactuals/modals/subjunctives.

(13) If <VDL>Leo sets the table</VDL> then <VDL>Rob serves dinner</VDL>.

(14) (a) <IVDL id=1 seg=1>Liz</IVDL> would have <IVDL id=1 seg=2>finished planting the

flowers</IVDL>, if <IVDL id=2 seg=1>Leo</IVDL> had not <IVDL id=2

seg=2>kicked over the wheelbarrow</IVDL>.

(b) If <IVDL id=1 seg=1>>John</IVDL>were to <IVDL id=1 seg=2>cut the

grass</IVDL> the lawn would be happier.

(15) (a) <IVDL id=1 seg=1>James</IVDL> may <IVDL id=1 seg=2>practice Tai Chi in the

garden</IVDL>

(b) <IVDL id=1 seg=1>James</IVDL> might have <IVDL id=1 seg=2>practiced Tai Chi

in the garden</IVDL>

garden</IVDL>

(d) <IVDL id=1 seg=1>James</IVDL> will <IVDL id=1 seg=2>practice Tai Chi in the

garden</IVDL>

Hypothetical or conditional propositions assert something to be the case provided something else is the case. We cannot literally see a conditional, so sentences expressing such propositions are not VDL. However, the antecedent and consequents of conditionals may be visual, as in (13).

In cases of modal (including negation and future tense) and counterfactual propositions and irrealis propositions expressed via the subjunctive mood, again we do not “see” whether the proposition as a whole is true. However, in such cases there may be an underlying VDL sentence whose evaluation as true or false (e.g. in some or all possible worlds, should one adopt a possible worlds semantics for modals) is logically key to the truth value of the overall proposition. In these case we mark up the underlying discontinuous segments forming the VDL segment as impure VDL. See (14) and (15).

Statements of purpose.

Components of sentences that express an agent’s purpose in doing something should not be annotated as VDL:

(16) <VDL>Billy climbed to the rooftop</VDL> to shoot at crows

Locational Information.

(17) The Episcopal Church was one block down Sussex Street

(18) The Eiffel Tower is in the 7th Arrondissement in Paris.

(19) <VDL>The Episcopal Church stood across the street.</VDL>

Some locational information is visually determinable, some is not. In general we mark up cases where the locational information is clearly visual, as in (19), but do not mark up cases where the locational information may not be visually confirmable (17 and 18). As a general rule any locational information that relies upon geopolitical naming, street plans or compass directions is not marked as visual.

Dialogues (Direct/Indirect speech)

Text segments that report dialogues do so using either direct (e.g. 20) or indirect (e.g. 21) quotation.

(20) Dorothy said that <VDL>Toto was running away</VDL>.

(21) Dorothy said, “<VDL>Toto is running away</VDL>”.

In both these case we mark the segment spoken as VDL, if it is VDL. As a matter of convention we do not mark the words reporting who spoken even if we could determine visually whether the person reporting was speaking. This is because (a) these segments are of little interest, and (b) there are many verbs that express fine shades of meaning with respect to spoken utterances, many of which are not visually determinable (e.g. ``reply”, “ask”, “exhort”, ``assert”) and it is easiest just to rule them all out.

Temporal Adverbials of Frequency and Duration.

Temporal adverbs of frequency (e.g. often, sometimes, usually) determine how frequently an activity takes place. For example:

(22) <VDL>Bob often goes to the park for a picnic</VDL>.

We mark such examples as VDL because our imaginary observer could determine visually, over a period of time, how frequently the activity takes place and make an assessment of whether the temporal term applies.

Note this does not apply to temporal adverbials that reference calendrical units. So in

(23) On Tuesdays <VDL>Bob goes to the park for a picnic</VDL>.

we do not mark the full sentence as VDL. This is because we cannot directly see that it is a Tuesday

Temporal adverbs of duration (e.g. for an hour) determine how long an activity takes. For example:

(24) <VDL>John cleaned up in the kitchen for a few minutes then went into the

lounge</VDL>.

We mark as VDL cases where the duration is intuitively/informally assessable as part of the viewing process. Cases where reference to a watch or calendar would be needed for precision or for tracking the extent of the activity are not marked. For example:

(25) <VDL>Usain ran the race</VDL> in 9.58 seconds.

(26) <VDL>Leon went skiing</VDL> for two weeks.

Imperative and interrogative sentences.

Imperative and interrogative sentences do not assert propositions and therefore, by our definition, cannot be VDL as a whole. However, they may contain components which are VDL or IVDL.

For example (27) is not VDL. However, in (28) we see an embedded component which is visual. In this case, the annotated text expresses a declarative which is an implicature of the interrogative.

(27) “Come out to the field and call us”, said the Queen.

(28) How did <IVDL>you</IVDL> manage to <IVDL>escape the great Wildcat</IVDL>?

Participial Phrases.

As noted above in Section 1, participial phrases may occur may express predications such as in <VDL>a man wearing a green shirt</VDL> where they occur within a noun phrase. This example is straightforward to annotate as it is a well-formed noun phrase. However, in some cases participial phrases may be extraposed and function, not so much as a reduced relative clause as a sentence adverbial.

For example:

(29) <VDL>Walking slowly across the ice, John </VDL> thought about his mother.

In this case we annotate across phrasal boundaries, in order to capture the argument of the activity described in the participial phrase, i.e. the entity about which something visual is being predicated.

Page updated

Google Sites

Report abuse