G-VUE

Perceive, Ground, Reason, and Act: A Benchmark for

General-purpose Visual Representation

Jiangyong Huang1,3∗, William Yicheng Zhu1∗, Baoxiong Jia1,2, Zan Wan1,4, Xiaojian Ma1,2, Qing Li1, Siyuan Huang1

1Beijing Institute for General Artificial Intelligence, 2University of California, Los Angeles, 3Peking University, 4Beijing Institute of Technology

∗ indicates equal contribution

In this work, we present a novel comprehensive benchmark for general-purpose vision: General-purpose Visual Understanding Evaluation (G-VUE), which consists of 11 meticulously chosen tasks. G-VUE covers the full spectrum of visual skills over four domains: Perceive, Ground, Reason and Act. We further introduce a general encoder-decoder framework that supports the evaluation of arbitrary visual representation on all 11 tasks. With G-VUE, we evaluate the performances of prevalent visual representations pre-trained from different architectures, learning paradigms, and data sources. In particular, we find that (1) Transformer-based visual representations with massive training data can dominate in most visual tasks, and (2) despite the differences of visual tasks, the performances among them are highly correlated, indicating a sense of general-purpose. Such findings may shed light on future foundation models for the general-purpose vision.

An overview of G-VUE. The key idea of G-VUE is to evaluate visual representations in a general- purpose standard, where P (Perceive), G (Ground), R (Reason), and A (Act) represent four functional domains for a general-purpose vision system.

Benchmark

The overview of tasks in G-VUE. We categorize the tasks according to their functional domains and provide details on the selected datasets, train/val/test splits sizes, input/output modalities, and evaluation metrics.

Framework

Due to the diverse nature of tasks in G-VUE, evaluating visual representations could be challenging. To this end, we propose a general encoder-decoder framework for these visual tasks, making it possible to adapt arbitrary visual representation to all 11 tasks. Specifically, we investigate the various formats of visual tasks. Altogether, we summarize out and implement five types of decoders to accommodate our full spectrum evaluation.

A holistic schematic of our encoder-decoder framework, as well as the illustration of tasks. The text encoder produces an embedding of the textual input for vision-language tasks. The visual representation, as the output of an image encoder, e.g., ResNet-50, possibly with the text embedding, will be sent to task decoders to accomplish the corresponding tasks.

Experiment

Visual Representations. We analyze three factors that may affect the quality of visual representations: architecture, pre-training mechanism, and source data. The architectures include two currently standardized models: ResNet and ViT. The pre-training mechanisms include the extent of supervision (e.g., supervised vs. self-supervised), and learning objectives (e.g., discriminative vs. generative). The source data involves ImageNet, large-scale image-text pairs, and Ego4D videos. To explore the effects of these factors, we select 7 candidates of visual representations. As a basic setting, we fix these visual encoders and feed the fixed visual representations to decoders, which are trained for accomplishing specific tasks.

An overview of evaluated visual representations.

Results. The experimental results are shown in the following table. In addition to metrics of each task, we provide a summary score (the bottom line) as an overall measure across all tasks. We highlight two major findings: (1) Compared with other candidates, the two ViT-CLIPs dominate on most tasks, which demonstrates the strength of Transformer-based visual representations with massive training data. (2) In general, the performances of a visual representation on different tasks appear to be consistent. In other words, a visual representation that performs well on a specific task is unlikely to struggle on other tasks. This implies that despite the differences of visual tasks, the performances among them are highly correlated, indicating a sense of general-purpose. Notably, the RN-Ego lags drastically behind others, which is probably attributed to the tremendous distribution gap of ego-view data.

Quantitative results of visual representations on G-VUE. For space considerations, we use the abbreviation of words for identifying tasks (e.g. “Cam. Pose.” for camera pose estimation, “I-T Retr.” for image-to-text retrieval, “Phr. Grnd.” for phrase grounding, “Sem. Seg.” for semantic segmentation, “Com. Res.” for common sense reasoning, “Abs. Res.” for abstract reasoning, “Nav.” for navigation and “Manip.” for manipulation). Summary scores are attached at the bottom.

Metrics correlations. To avoid misunderstanding, we point out that such correlations only apply to visual representations with unbiased training. For example, a visual representation finetuned on the depth estimation task will be biased and inevitably encounter performance drop on other tasks.

Conclusion

In this work, we present a novel general-purpose vision benchmark G-VUE, which consists of 11 meticulously chosen tasks. G-VUE covers the full spectrum of visual skills over four domains: Perceive, Ground, Reason and Act. We further introduce a general encoder-decoder framework that supports the evaluation of arbitrary visual representation on all 11 tasks. With G-VUE, we evaluate 7 prevalent visual representations pre-trained with different architectures, learning paradigms and data sources. In particular, we find that (1) Transformer-based visual representations with massive training data can dominate in most visual tasks, and (2) despite the differences of visual tasks, the performances among them are highly correlated, indicating a sense of general-purpose. Such findings may shed light on the path toward building a unified model for general-purpose vision.

We release the code at https://github.com/wllmzhu/G-VUE. We also host a public leaderboard at https://eval.ai/web/challenges/challenge-page/1791/overview. We hope our efforts will promote the study of general-purpose representation learning and encourage the computer vision community to pursue general-purpose visual models that are universal and easily adaptable.

BibTex

@article{huang2022perceive,

title={Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation},

author={Huang, Jiangyong and Zhu, William Yicheng and Jia, Baoxiong and Wang, Zan and Ma, Xiaojian and Li, Qing and Huang, Siyuan},

journal={arXiv preprint arXiv:2211.15402},

year={2022}

}

Page updated

Google Sites

Report abuse