Speakers

Nicu Sebe

Biography:

Nicu Sebe is a professor in the University of Trento, Italy, where he is leading the research in the areas of multimedia information retrieval and human-computer interaction in computer vision applications. He received his PhD from the University of Leiden, The Netherlands and has been in the past with the University of Amsterdam, The Netherlands and the University of Illinois at Urbana-Champaign, USA. He was involved in the organization of the major conferences and workshops addressing the computer vision and human-centered aspects of multimedia information retrieval, among which as a General Co-Chair of the IEEE Automatic Face and Gesture Recognition Conference, FG 2008, ACM International Conference on Multimedia Retrieval (ICMR) 2017 and ACM Multimedia 2013. He was a program chair of ACM Multimedia 2011 and 2007, ECCV 2016, ICCV 2017, ICPR 2020 and a general chair of ACM Multimedia 2022. He is a program chair of ECCV 2024. He is a fellow of ELLIS, IAPR and a Senior member of ACM and IEEE.

Title: Cross-modal understanding and generation of multimodal content

ABSTRACT: Video generation consists of generating a video sequence so that an object in a source image is animated according to some external information (a conditioning label, a driving video, a piece of text). In this talk I will present some of our recent achievements addressing  generating videos without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. Based on this,  I will present our framework to train game-engine-like neural models, solely from monocular annotated videos. The result —a Learnable Game Engine (LGE)— maintains states of the scene, objects and agents in it, and enables rendering the environment from a controllable viewpoint. Similarly to a game engine, it models the logic of the game and the underlying rules of physics, to make it possible for a user to play the game by specifying both high- and low-level action sequences. Our LGE can also unlock the director's mode, where the game is played by plotting behind the scenes, specifying high-level actions and goals for the agents in the form of language and desired states. This requires learning “game AI”, encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, devise the strategy to win a point.

Gül Varol

Biography: 

Gül Varol is a permanent researcher in the IMAGINE team at École des Ponts ParisTech. Previously, she was a postdoctoral researcher at the University of Oxford (VGG). She obtained her PhD from the WILLOW team of Inria Paris and École Normale Supérieure (ENS). Her thesis received the PhD awards from ELLIS and AFRIF. Her research is focused on computer vision, specifically video representation learning, human motion analysis, and sign languages.


Title: Is Human Motion a Language without Words?

Abstract: This talk will summarize our recent works on bridging the gap between natural language and 3D human motions. I will first show results on text-to-motion synthesis, i.e., text-conditioned generative models for controllable motion synthesis, with a special focus on compositionality to handle finegrained textual descriptions. Second, I will present results from our text-to-motion retrieval model. The relevant papers are ACTOR, TEMOS, TMR [Petrovich 2021, 2022, 2023] and TEACH, SINC [Athanasiou 2022, 2023].