Accepted Papers
Track 1 (unpublsihed work):
Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
Takehiko Ohkawa (The University of Tokyo); Takuma Yagi (National Institute of Advanced Industrial Science and Technology (AIST)); Taichi Nishimura (LY Corporation); Ryosuke Furuta (The University of Tokyo); Atsushi Hashimoto (OMRON SINIC X Corp.); Yoshitaka Ushiku (OMRON SINIC X Corp.); Yoichi Sato (University of Tokyo)
[paper]Multi-Sentence Grounding for Long-term Instructional Video
Zeqian Li (Cooperative Medianet Innovation Center, Shang hai Jiao Tong University); Qirui Chen (Shanghai Jiao Tong University ); Tengda Han (University of Oxford; Google DeepMind); Ya Zhang (Cooperative Medianet Innovation Center, Shang hai Jiao Tong University); Yan-Feng Wang (Cooperative medianet innovation center of Shanghai Jiao Tong University); Weidi Xie (Shanghai Jiao Tong University)
[paper]Fine-Grained Action Understanding with Tools in Instructional Videos
Saelyne Yang (KAIST); Jaesang Yu (KAIST); Jae Won Cho (Sejong University); Juho Kim (KAIST)
[paper]Agglomerative Clustering of Atomic Actions for Unsupervised Action Segmentation
Pulkit Kumar (University of Maryland); Austin Myers (Google); Anurag Arnab (Google); David A Ross (Google); Abhinav Shrivastava (University of Maryland); Sudheendra Vijayanarasimhan (Google research)
[paper]What to Say and When to Say it: A Video-Language Model and Benchmark for Situated Interactions
Apratim Bhattacharyya (Qualcomm AI Research); Sunny P Panchal (Qualcomm AI Research); Guillaume J. F. Berger (Qualcomm Technologies Inc.); Antoine Mercier (Qualcomm Technologies Inc); Cornelius Böhm (Aignostics GmbH); Florian Dietrichkeit (LifeBonus Gesundheitsmanagement GmbH); Xuanlin Li (UCSD); Reza Pourreza (Qualcomm); Pulkit Madan (Qualcomm); Sanjay Haresh (Qualcomm AI Research); Mingu Lee (Qualcomm AI Research); Mark Todorovich (Qualcomm); Ingo Bax (Qualcomm A.I. Research); Roland Memisevic (Qualcomm AI Research)
[paper]Learning Object States from Actions via Large Language Models
Masatoshi Tateno (Institute of Industrial Science, The University of Tokyo); Takuma Yagi (National Institute of Advanced Industrial Science and Technology (AIST)); Ryosuke Furuta (The University of Tokyo); Yoichi Sato (University of Tokyo)
[paper]Ordering Mistake Detection in Assembly Tasks
Guodong Ding (National University of Singapore); Fadime Sener (University of Bonn); Shugao Ma (Meta Reality Labs); Angela Yao (National University of Singapore)
[paper]
Track 2 (publsihed work):
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Brian Chen (Samsung Research America); Nina Shvetsova (Goethe University Frankfurt); Andrew Rouditchenko (MIT CSAIL); Daniel P Kondermann (Quality Match GmbH); Samuel Thomas (IBM Research AI); Shih-Fu Chang (Columbia University); Rogerio Feris (MIT-IBM Watson AI Lab, IBM Research); James Glass (Massachusetts Institute of Technology); Hilde Kuehne (University of Bonn)
[paper] @CVPR 2024YTCommentQA: Video Question Answerability in Instructional Videos
Saelyne Yang (KAIST); Sunghyun Park (LG AI Research); Yunseok Jang (University of Michigan); Moontae Lee (University of Illinois at Chicago)
[paper] @AAAI 2024MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Chaoyi Zhang (University of Sydney); Kevin Lin (Microsoft); Zhengyuan Yang (Microsoft); Jianfeng Wang (Microsoft); Linjie Li (Microsoft); Chung-Ching Lin (Microsoft); Zicheng Liu (Microsoft); Lijuan Wang (Microsoft)
[paper] @CVPR 2024Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos
Ravindu Y Nagasinghe (Mohamed bin Zayed University of Artificial Intelligence); Honglu Zhou (Computer Science Department, Rutgers University); Malitha Gunawardhana (Auckland Bioengineering Institute, University of Auckland); Martin Renqiang Min (NEC Labs America-Princeton); Daniel Harari (Weizmann Institute of Science); Muhammad Haris Khan (Mohamed Bin Zayed University of Artificial Intelligence)
[paper] @CVPR 2024Retrieval-Augmented Egocentric Video Captioning
Jilan Xu (Fudan University); Yifei Huang (The University of Tokyo); Junlin Hou (Fudan University); Guo Chen (Nanjing University); Yuejie Zhang (Fudan University); Rui Feng (Fudan University); Weidi Xie (Shanghai Jiao Tong University)
[paper] @CVPR 2024EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
Yifei Huang (The University of Tokyo); Guo Chen (Nanjing University); Jilan Xu (Fudan University); Mingfang Zhang (The University of Tokyo); Lijin Yang (The University of Tokyo); Baoqi Pei (Zhejiang University); Lu Dong (University of Science and Technology of China); Hongjie Zhang (Nanjing University); Yali Wang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences); Limin Wang (Nanjing University); Yu Qiao (Shanghai AI Laboraotry)
[paper] @CVPR 2024AssistGUI: Task-Oriented PC Graphical User Interface Automation
Difei Gao (NUS); Lei Ji (Microsoft); Zechen Bai (NUS); Mingyu Ouyang (NUS); Peiran Li (NUS); Dongxing Mao (National University of Singapore); Qinchen Wu (NUS); Weichen Zhang (NUS); Peiyi Wang (NUS); Xiangwu Guo (NUS); Hengxu Wang (NUS); Luowei Zhou (Microsoft); Mike Zheng Shou (National University of Singapore)
[paper] @CVPR 2024PREGO: online mistake detection in PRocedural EGOcentric videos
Alessandro Flaborea (Sapienza University of Rome); Guido Maria D'Amely di Melendugno (Sapienza University); Leonardo Plini (Sapienza Università di Roma); Luca Scofano (Sapienza University of Rome); Edoardo De Matteis (Sapienza University of Rome); Antonino Furnari (University of Catania); Giovanni Maria Farinella (University of Catania); Fabio Galasso (Sapienza University of Rome)
[paper] @CVPR 2024FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Action Segmentation
Zijia Lu (Northeastern University); Ehsan Elhamifar (Northeastern University)
[paper] @CVPR 2024Progress-Aware Online Action Segmentation for Egocentric Procedural Task Videos
Yuhan Shen (Northeastern University); Ehsan Elhamifar (Northeastern University)
[paper] @CVPR 2024Error Detection in Egocentric Procedural Task Videos
Shih-Po Lee (Northeastern University)*; Zijia Lu (Northeastern University); ZEKUN ZHANG (Stony Brook University); Minh Hoai (Stony Brook University); Ehsan Elhamifar (Northeastern University)
[paper] @CVPR 2024Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation
Ming Xu (Australian National University )*; Stephen Gould (Australian National University, Australia)
[paper] @CVPR 2024Activity Grammars for Temporal Action Segmentation
Dayoung Gong (POSTECH)*; Joonseok Lee (POSTECH); Deunsol Jung (POSTECH); Suha Kwak (POSTECH); Minsu Cho (POSTECH)
[paper] @NeurIPS 2023Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
Muhammad Hamza (ETH Zurich); Xi Wang (ETH Zurich); Otmar Hilliges (ETH Zurich); Luc Van Gool (ETH Zurich); Kaichun Mo (NVIDIA); Alexey Gavryushin (ETH Zurich)*; Razvan E Pasca (ETHZ); Yen-Ling Kuo (University of Virginia)
[paper] @CVPR 2024RMem: Restricted Memory Banks Improve Video Object Segmentation
Junbao Zhou (UIUC)*; Ziqi Pang (UIUC); Yu-Xiong Wang (University of Illinois at Urbana-Champaign)
[paper] @CVPR 2024Learning to Predict Activity Progress by Self-Supervised Video Alignment
Gerard L Donahue (Northeastern University)*; Ehsan Elhamifar (Northeastern University)
[paper] @CVPR 2024Detours for Navigating Instructional Videos
Kumar Ashutosh (UT Austin)*; Zihui Xue (The University of Texas at Austin); Tushar Nagarajan (FAIR, Meta); Kristen Grauman (Facebook AI Research)
[paper] @CVPR 2024Learning Object State Changes in Videos: An Open-World Perspective
Zihui Xue (The University of Texas at Austin)*; Kumar Ashutosh (UT Austin); Kristen Grauman (Facebook AI Research)
[paper] @CVPR 2024