Invited Talk: Promptable Vision Foundation in the Wild: From Head to Tail
Jianwei Yang, Microsoft Research, Redmond, USA
Abstract.
Recent advancements in large language models (LLMs) have revolutionized Human-AI interaction systems. Inspired by the generality and capabilities exhibited by LLMs, this presentation delves into the potential of building a generalist vision foundation that is capable for a wide range of vision tasks. Addressing vision tasks presents unique challenges due to their diverse nature in terms of model input, output, and task granularity. In this presentation, I will talk about my main efforts to overcome the challenges by building promptable vision foundation in the wild. In particular, I will discuss three main efforts: (1) first, how to build interactive and promptable vision systems to serve pixel-level vision tasks; (2) second, how to make the model more controllable and semantic-aware (3) third, how to extend the model for specific domain like biomedical images. In the end, I will envision the vast potential of leveraging these generalist vision foundations to build intelligent AI agents in both digital and physical world.