Can Transformers Capture Spatial Relations between Objects?

IIIS Tsinghua University, University of Pennsylvania, 

Shanghai Artificial Intelligence Laboratory, Shanghai Qi Zhi Institute

Abstract

Spatial relationships between objects represent key scene information for humans to understand and interact with the world. To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple “RelatiViT” architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings.

Spatial Relation Prediction Benchmark

Our Spatial Relation Prediction (SRP) benchmark includes two datasets Rel3D and SpatialSense+. Especially, SpatialSense+ is relabeled from the original SpatialSense dataset with more precise and physically-grounded relation definitions. Here are the dataset statistics:

Transformer-centric Architectures

According to four design axes: feature extraction, query localization, context aggregation, and pair interaction, we design these four different architectures for the spatial relation prediction task.

Experimental Results

Comparison between Different Designs:

RelatiViT performs significantly better than the other three methods by more effectively reading out relation information from pre-trained Vision Transformer.

Comparison with Baseline Methods:

Our best model RelatiViT outperforms all the baselines, even including VLMs (e.g., Gemini and GPT-4V) across two datasets.

RelatiViT is the first model to beat the naive bbox-only baseline, demonstrating its capability to capture spatial relation through visual information!

Citation

If you find our work useful for your research, please cite our paper using this BibTeX:

@inproceedings{wen2023can,

  title={Can Transformers Capture Spatial Relations between Objects?},

  author={Wen, Chuan and Jayaraman, Dinesh and Gao, Yang},

  booktitle={The Twelfth International Conference on Learning Representations},

  year={2023}

}