Benchmark
Find all the details about our benchmark at
Motivation
Understanding the semantics in images is an integral part of computer vision. Over the past decades, our community has defined a diverse set of semantic tasks, like image classification, object detection, semantic & panoptic segmentation, action & attribute recognition, image captioning, visual question answering, or visual grounding. Treating these tasks individually leads to developing specialized models that can only reason about a limited type and range of semantics. The next generation of computer vision models should understand a broader spectrum of labels in a unified way.
We define three groups of category descriptions:
Plain category names or nouns like “person”, “cat” or “animal”. Note that we include nouns at all levels of the object hierarchy (like “corgi”, “dog” and “animal”), unlike label spaces in many common datasets like COCO or Objects-365.
Modifying descriptions that further specify categories by using attributes (“blue cars”), actions (“dogs sitting”), functions (“edible items”) or relations (“person on skateboard”). Note that such modifiers can operate at all levels of the object hierarchy and, hence, abstract across label spaces of standard datasets like COCO or Objects-365. For instance, “edible items” would include, among others, apples, bananas or hot dogs.
Compositions of descriptions from the first two groups (“woman with yellow handbag” or “man wearing white hat walking next to fire hydrant”)
Novel benchmark
As a first step towards our vision, we are introducing a new benchmark for object detection and are organizing a challenge with the workshop. The key contribution of our new benchmark is a comprehensive evaluation protocol and dataset that goes beyond standard object detection and referring expression datasets. The figure below illustrates the novel task and the desired output we expect from proposed models.
Difference to object detection benchmarks: The main difference is the label space, which is more complex (with natural text object descriptions) as well as dynamic (size of label space changes for every test image). Standard object detectors fail this task because of their fixed label space assumption.
Difference to referring expression benchmarks: While the task definition is similar, there are important differences in the data and the evaluation. First, the object descriptions Di range from plain categories to highly specific descriptions. Second, the description in most referring expression benchmarks refers to exactly one instance in the image. PhraseCut is the only exception where expressions can refer to multiple instances. Our task and evaluation data is defined more broadly where any object description can refer to zero, one, or multiple instances in the image. Even specific descriptions like “person wearing red shirt and sunglasses” may not be present in an image. The model M needs to output an empty set in this case. Third, the object descriptions we collect are more challenging with examples that also include negations, e.g., a negation of an attribute like “cup NOT on the table”.