The 1st International Workshop on Human-Centered Vision and Media Technologies
*This image was created with the assistance of DALL·E 3.
Date
Friday, April 12, 2024, 12:30 - 18:30
Venue
Institute of Industrial Science, the University of Tokyo
Seminar Rooms An401/402 (Talk), An403, An404 (Poster)
Registration
Participation in the workshop is free, but please register for the workshop using the form below.
If the number of participants exceeds the room capacity, we may close the registration.
Program
12:30 - 13:30 Invited Talk: Yuki M. Asano
13:30 - 15:00 Poster Session 1
15:00 - 16:00 Invited Talk: Pascal Mettes
16:00 - 17:30 Poster Session 2
17:30 - 18:30 Invited Talk: Xucong Zhang
Invited Speakers
Yuki M. Asano (University of Amsterdam)
Title: Self-Supervised Learning in the age of CLIP et al.
Abstract: I will talk about new developments in self-supervised learning that will form the core of the next generation of foundation models. First, I will talk about how pretraining on videos can enable models that outperform models such as DINO by leveraging temporality. Second, I will show how ideas from self-supervised learning can be leveraged to drastically reduce the amount of paired image-text data needed for essential vision-language models.
Bio: Yuki M. Asano is an assistant professor for computer vision and machine learning at the QUVA lab at the University of Amsterdam. Prior to this, he completed his PhD at the Visual Geometry Group (VGG) at the University of Oxford where he worked with Andrea Vedaldi and Christian Rupprecht. He has served as an AC for NeurIPS/CVPR/ECCV/ICCV and is the main organiser of the SSLWIN workshops at ECCV, BigMAC at ICCV and co-organises the SSL workshop at NeurIPS.
Pascal Mettes (University of Amsterdam)
Title: Hyperbolic Deep Learning
Abstract: From linear layers and convolutions to self-attention, deep learning is implicitly Euclidean. But should it be? In this talk, I will dive into hyperbolic geometry for deep learning. I will discuss what hyperbolic geometry is and what is different compared to Euclidean geometry. I will then outline the strong potential of hyperbolic deep learning, from learning hierarchical representations to uncertainty and robustness to out-of-distribution and adversarial samples. Lastly, I will show our ongoing efforts towards fully hyperbolic networks and how to get started in this field with our new hyperbolic learning software library.
Bio: Pascal Mettes is an assistant professor at the University of Amsterdam on the topic of knowledge-aware visual understanding. His research focuses on hyperbolic deep learning for computer vision. He received his PhD (2017) and was a postdoc (2018-2019) in computer vision at the University of Amsterdam and was previously affiliated with Columbia University (2016) and the University of Tübingen (2021). He organised the ICCV’21 workshop on Structured Representations for Video Understanding, the Netherlands Conference on Computer Vision 2022, and the ECCV’22 + CVPR’23 tutorials on Hyperbolic Representation Learning in Computer Vision.
Xucong Zhang (Delft University of Technology)
Title: Visually Humans Measurement
Abstract: The natural interaction between humans and AI agents is critical as well as challenging for human-centered intelligent systems, such as personalized robots and virtual characters in AR/VR-based telepresence systems. It significantly influences user acceptance of AI agents, which in turn determines the development of these AI technologies. For example, an intelligent social robot should recognize the intention of a person aided with audio and non-verbal cues, and react naturally like another human being to keep the engagement with the user. I aim to develop an approach for the natural human-AI interaction that accurately perceives human behavior and generates human-like responses to drive AI agents. The technical core of this project is the development of a holistic model to handle multiple behavior features including facial expression, eye gaze, body postures, hand gestures, and speech to faithfully reflect the subtle movements and audiovisual signals.
Bio: Xucong Zhang is an assistant professor at TU Delft and an ELLIS society member. He was a postdoc researcher from 2018 to 2021 in the Advanced Interaction Technologies Lab at ETH Zurich, led by Prof. Otmar Hilliges. Xucong did my PhD research (summa cum laude) from 2013 to 2018 at Max Planck Institute for Informatics under the supervision of Prof. Andreas Bulling. Before that, He obtained my Master’s degree (2013) at Beihang University and my Bachelor’s degree (2010) from the Honors Program at China Agriculture University, China. The core research interest of Xucong is human-centered computing as developing techniques to sense and serve human users.
Poster Presentations
Poster Session 1
Poster ID: Presenter (Affiliation), Title
P1-1: Ziling Huang (UTokyo/NII), Referring Image Segmentation via Joint Mask Contextual Embedding Learning and Progressive Alignment Network
P1-2: Zhixiang Wang (UTokyo/NII), Neural Cameras
P1-3: Xiangyu Chen (UTokyo/NII), Diagnosis of Critical Factors in Video Relation Detection
P1-4: Zhijing Wan (WuhanU/NII), Contributing Dimension Structure of Deep Feature for Coreset Selection
P1-5: Elise Lincker (CNAM/NII), Multimodal Content Extraction and Enrichment: Textbooks as a Use Case
P1-6: Zhaohui Zhu (UTokyo/NII), Computer-Assisted Noise Pareidolia Tests through Patient Emulation
P1-7: Shengzhou Yi (UTokyo), Assessment of Oral Presentation Skills and Their Practical Implementations
P1-8: Ling Xiao (UTokyo), Advanced fashion intelligence
Poster Session 2
Poster ID: Presenter (Affiliation), Title
P2-1: Takehiko Ohkawa (UTokyo), AssemblyHands Benchmark and Challenge for Egocentric 3D Hand Pose Estimation
P2-2: Ryosuke Furuta (UTokyo), Seeking Flat Minima with Mean Teacher on Semi- and Weakly-Supervised Domain Generalization for Object Detection
P2-3: Yifei Huang (UTokyo), Understanding Actions in Videos with Limited Training Labels
P2-4: Mingfang Zhang (UTokyo), Masked Vdeo and Body-worn IMU Autoencoder for Egocentric Action Recognition
P2-5: Masatoshi Tateno (UTokyo), Learning Object States from Actions via Large Language Models
P2-6: Zhehao Zhu (UTokyo), Prompt-augmented Boundary Attentive Learning for Weakly-supervised Temporal Sentence Grounding
P2-7: Liangyang Ouyang (UTokyo), ActionVOS: Actions as Prompts for Video Object Segmentation
P2-8: Jiawei Qin (UTokyo), Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement
P2-9: Yilin Wen (UTokyo), Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos
Organizers
Yoichi Sato (The University of Tokyo)
Shin'ichi Satoh (National Institute of Informatics)
Toshihiko Yamasaki (The University of Tokyo)
Yusuke Sugano (The University of Tokyo)
Ryosuke Furuta (The University of Tokyo)