UniVS:Unified and Universal Video Segmentation 

with Prompts as Queries

Minghan LI, Shuai LI, Xindong Zhang, and Lei Zhang

Hong Kong Polytechnic University, OPPO Research Institute

👏  Read our ArXiv paper: https://arxiv.org/abs/2402.18115

🎉 Code in Github: https://github.com/MinghanLi/UniVS/

Abstract

Despite the recent advances in unified image segmentation (IS), developing a unified video segmentation (VS) model remains a challenge. This is mainly because generic category-specified VS tasks need to detect all objects and track them across consecutive frames, while prompt-guided VS tasks require re-identifying the target with visual/text prompts throughout the entire video, making it hard to handle the different tasks with the same architecture. We make an attempt to address these issues and present a novel unified VS architecture, namely UniVS, by using prompts as queries. UniVS averages the prompt features of the target from previous frames as its initial query to explicitly decode masks, and introduces a target-wise prompt cross-attention layer in the mask decoder to integrate prompt features in the memory pool. By taking the predicted masks of entities from previous frames as their visual prompts, UniVS converts different VS tasks into prompt-guided target segmentation, eliminating the heuristic inter-frame matching process. Our framework not only unifies the different VS tasks but also naturally achieves universal training and testing, ensuring robust performance across different scenarios. UniVS shows a commendable balance between performance and universality on 10 challenging VS benchmarks, covering video instance, semantic, panoptic, object, and referring segmentation tasks. Code can be found at https://github.com/MinghanLi/UniVS.

Overview

Fig1. Training process of our unified video segmentation (UniVS) framework. UniVS contains three main modules: the Image Encoder (grey rectangle), the Prompt Encoder (purple rectangle) and the Unified Video Mask Decoder (yellow rectangle). The Image Encoder transforms the input RGB images to the feature space and outputs image embeddings. Meanwhile, the Prompt Encoder translates the raw visual/text prompts into prompt embeddings. The Unified Video Mask Decoder explicitly decodes the masks for any entity or prompt-guided target in the input video by using prompts as queries (striped triangles, hexagons and circles).

Fig.2 Inference process of our UniVS on prompt-specified and category-specified video segmentation tasks, respectively.

Results of VIS / VSS / VPS (Category-specified VS tasks) 

VIS

VSS

VPS

VIS

Results of VOS: visual prompts for thing entities 

VSS

Visual prompt in the first frame

VPS

VOS

Results of PVOS:  visual prompts for thing and stuff entities


Visual prompts in the first frame

Visual prompts for the newly entity

Video

Ground Truth

UniVS

Results of PVOS:  visual prompts for thing and stuff entities

Visual prompts in the first frame

There are no newly appeared entities in this sequence

Video

Ground Truth

UniVS

Results of RefVOS: text prompts

Prompt: "a brown kangaroo is on the green grass looking behind"

Prompt: "a person walking behind a kangaroo"

Prompt: "a frog is holded by a person in his hand and place near the another frog"

Prompt: "a human hand picking a frog"

Citation

@misc{li2024univs,

      title={UniVS: Unified and Universal Video Segmentation with Prompts as Queries}, 

      author={Minghan Li, Shuai Li, Xindong Zhang, and Lei Zhang},

      year={2024},

      eprint={2402.18115},

      archivePrefix={arXiv},

      primaryClass={cs.CV}

}