Program

Program Schedule in EST

Live stream Link: https://www.youtube.com/watch?v=GzIphByhXDc

Speakers

Bryan Plummer (Boston University)

Bryan Plummer is an Assistant Professor in the Department of Computer Science at Boston University, and is a core faculty member of the Artificial Intelligence Research (AIR) Initiative in the Rafik B. Hariri Institute for Computing and Computational Science & Engineering. He obtained his PhD in the computer vision group at the University of Illinois at Urbana-Champaign where he received a 3M Foundation Fellowship and was an NSF GRFP honorable mention. Bryan’s research includes significant work on vision-language understanding including applications in phrase grounding, multilingual reasoning, detecting manipulated media, and disentangled and structured representation learning.

Lisa Anne Hendricks (DeepMind)

Lisa is a research scientist on the Language Team at DeepMind. She received my PhD from Berkeley in May 2019, and a BSEE (Bachelor of Science in Electrical Engineering) from Rice University in 2013. Her research focuses on the intersection of language and vision. She is particularly interested in analyzing why models work, explainability, and mitigating/measuring bias in AI models

Alec Radford (OpenAI)

Alec Radford is a research scientist at OpenAI. His research focuses on uses of generative models and developing scalable learning methods that leverage natural language supervision. Previously he worked on DCGAN, the GPT series of language models, and most recently CLIP.

Cordelia Schmid (INRIA)

Cordelia Schmid holds a permanent researcher position at Inria since 1997, where she is a research director. Starting 2018 she has a joint appointment with Google research. She has published more than 300 articles, mainly in computer vision. She has been editor-in-chief for IJCV (2013--2018), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015, ECCV 2020 and ICCV 2023. In 2006, 2014 and 2016, she was awarded the Longuet-Higgins prize for fundamental contributions in computer vision that have withstood the test of time. She is a fellow of IEEE. She was awarded an ERC advanced grant in 2013, the Humbolt research award in 2015 and the Inria & French Academy of Science Grand Prix in 2016. She was elected to the German National Academy of Sciences, Leopoldina, in 2017. In 2018 she received the Koenderink prize for fundamental contributions in computer vision. She received the Royal Society Milner award in 2020.

Mohit Bansal (UNC Chapel Hill)

Dr. Mohit Bansal is the John R. & Louise S. Parker Associate Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science department at University of North Carolina (UNC) Chapel Hill. Prior to this, he was a research assistant professor (3-year endowed position) at TTI-Chicago. He received his PhD from UC Berkeley in 2013 (where he was advised by Dan Klein) and his BTech from IIT Kanpur in 2008. His research expertise is in statistical natural language processing and machine learning, with a particular focus on multimodal, grounded, and embodied semantics (i.e., language with vision and speech, for robotics), human-like language generation and Q&A/dialogue, and interpretable and generalizable deep learning. He is a recipient of the 2020 IJCAI Early CAREER Spotlight, 2019 DARPA Director's fellowship, 2019 Google Focused Research Award, 2019 Microsoft Investigator Fellowship, 2019 NSF CAREER Award, 2018 ARO Young Investigator Award (YIP), 2017 DARPA Young Faculty Award (YFA), 2017 ACL Outstanding Paper Award, 2014 ACL Best Paper Award Honorable Mention, 2018 COLING Area Chair Favorites Paper Award, and 2019 ACL Best Short Paper Nomination. His service includes Program Co-Chair for CoNLL 2019, Senior Area Chair for several ACL and EMNLP conferences, Americas Sponsorship Co-Chair for the ACL, and Associate/Action Editor for TACL, Computational Linguistics (CL), IEEE/ACM TASLP, and CSL journals.

Accepted Papers-Archival Track

Egocentric Biochemical Video-and-Language Dataset

Taichi Nishimura (Kyoto University)*; Kojiro Sakoda (Kyoto University); Atsushi Hashimoto (OMRON SINIC X Corp.); Yoshitaka Ushiku (OMRON SINIC X ); Natsuko Tanaka (Osaka medical and pharmaceutical university); Fumihito Ono (Osaka medical and pharmaceutical university); Hirotaka Kameko (Kyoto University); Shinsuke Mori (Academic Center for Computing and Media Studies, Kyoto University)

paper link

CIGLI: Conditional Image Generation from Language & Image

Xiaopeng Lu (Carnegie Mellon University)*; Lynnette Ng (Carnegie Mellon University); Jared Fernandez (Carnegie Mellon University); Hao Zhu (Carnegie Mellon University)

paper link, supp link

Semi-Autoregressive Transformer for Image Captioning

Yuanen Zhou (Hefei University of Technology)*; Yong Zhang (Tencent AI Lab); Zhenzhen Hu (Hefei University of Technology); Meng Wang (Hefei University of Technology)

paper link, supp link

Latent Variable Models for Visual Question Answering

Zixu Wang (Imperial College London)*; Yishu Miao (Imperial College London); Lucia Specia (Sheffield/Imperial College London)

paper link

What You Say Is Not What You Do: Studying Visio-Linguistic Models for TV Series Summarization

Alison Reboud (Eurecom)*; Raphael Troncy (EURECOM)

paper link

Visual Question Answering with Textual Representations for Images

Yusuke Hirota (Osaka University)*; Noa Garcia (Osaka University); Mayu Otani (CyberAgent, Inc.); Chenhui Chu (Kyoto University); Yuta Nakashima (Osaka University); Ittetsu Taniguchi (Osaka University); Takao Onoye (Osaka University

paper link, supp link

Language-guided Multi-Modal Fusion for Video Action Recognition

Jenhao Hsiao (OPPO US Research Center)*; YiKang Li (OPPO US Research Center); Chiuman Ho (OPPO US Research Center)

paper link

Accepted Papers-Non-Archival Track

On the hidden treasure of dialog in video question answering

Deniz Engin (Inria)*; Francois Schnitzler (InterDigital); Quang-Khanh-Ngoc Duong (Interdigital); Yannis Avrithis (Inria)

paper link, supp link

Resilient Data Augmentation Approaches to Multimodal Verification in the News Domain

John Cadigan (SRI International)*; Karan Sikka (SRI International); Meng Ye (SRI International); Martin Graciarena (SRI International)

paper link

Exploring Long Tail Visual Relationship Recognition with Large Vocabulary

Sherif Abdelkarim (KAUST)*; Aniket Agarwal (IIT Roorkee); Panos Achlioptas (Stanford University); Jun Chen (King Abdullah University of Science and Technology); Jiaji Huang (Baidu Research); Boyang Li (Nanyang Technological University); Kenneth W Church (Baidu); Mohamed Elhoseiny (KAUST)

paper link

Neural Event Semantics for Grounded Language Understanding

Shyamal Buch (Stanford University)*; Li Fei-Fei (Stanford University); Noah Goodman (Stanford University)

paper link

Understanding of Emotion Perception from Art

Digbalay Bose (University of Southern California)*; Krishna Somandepalli (University of Southern California); Souvik Kundu (University of Southern California); Rimita Lahiri (University of Southern California); Jonathan Gratch (University of Southern California); Shrikanth Narayanan (USC)

paper link