1st Workshop

DeepView: Global Multi-Target Visual Surveillance Based on

Real-Time Large-Scale Analysis

in conjunction with IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS 2021)

Accepted papers will be published in IEEE Xplore!

November 16th Virtually

🌐 DeepView@AVSS'21 Virtual is now live HERE.

News & Updates

  • November 16, 2021: DeepView@AVSS'21 Virtual is now live HERE.

  • November 12, 2021: Prof. Ming Jin has accepted our invitation to present a keynote presentation.

  • November 10, 2021: Prof. Fabio Cuzzolin has accepted our invitation to present a keynote presentation. See this for details!

  • November 3, 2021: The workshop program schedule is available from now on. See this for details!

  • November 1, 2021: Final decisions on the manuscripts have been reached to the authors. Prepare for your camera-ready copies.

  • October 20, 2021: The key dates of DeepView2021 have been changed. See this for details!

  • September 10, 2021: Prof. Hanseok Ko has accepted our invitation to present a keynote presentation.

  • August 12, 2021: Prof. Mike Shou has accepted our invitation to present a keynote presentation.

  • June 06, 2021: The DeepView paper submission website is now open! Go to submission!

  • May 26, 2021: The website for DeepView2021 is opened. Please check soon for more information.


In recent years, there has been great progress in demand for visual surveillance systems and intelligent cities capable of providing accurate traffic measurements and essential information for user-friendly monitoring and real-world applications. It is a very practical and essential system that is based on large-scale camera network systems consisting of object detection, tracking, re-identification, and human behavior analysis. However, in many emerging applications, there are still main challenges due to the real-world scenes taken by large-scale cameras, such as illumination changes, dynamic backgrounds, poor data quality, and the lack of high-quality models. In order to tackle key challenges, researchers and engineers strive for developing robust algorithms that can be applied to large-scale surveillance systems. Based on our fundamental knowledge, we want to further upgrade our knowledge of the topic through cooperation with various researchers. In this workshop, we seek original contributions reporting the most recent progress on different computer vision methodologies for surveillance analysis of large-scale visual content and its wide applications that will help make smart systems.


(Pacific Time Zone, UTC+09:00)

4 Invited Talks + 5 Oral Presentations + DeepView21' Challenge Results Announcement

Invited Speakers

Hanseok Ko

Professor of Electrical and Computer Engineering at Korea University

[Topic] Video Analytics of Human Behavior for Combatting Pandemic in Crowded Space

[Biography / Abstract]


Hanseok Ko is Professor of Electrical and Computer Engineering at Korea University, Seoul, and has been serving as the Director of Intelligent Signal Processing Lab, sponsored by industries and national grants, to engage research on video analytics and multimodal (audio and visual) technologies for surveillance, robotics, and interface applications. Dr. Ko was a founding editor of the Journal of Communication and Networks, served as editor of Sensors, and co-Chaired IEEE ICASSP 2018, Calgary. He has been actively engaged in the research efforts developing solutions addressing the multimodal-based technology issues, including human-machine interaction problems. Over the course of the audio-visual multimodal research, his recent interest has been in developing social behavior analytic tools. Such work has resulted in two books (Multisensor Fusion and Integration for Intelligent Systems (2008) and Multisensor Fusion and Integration in the wake of big data, deep learning, and cyber-physical systems (2018), which have received significant readership over time. Employing multimodal-based human-computer interactive engagement of people can be an effective tool for observing social behaviors. He is currently serving as General Chair of IEEE ICASSP 2024 and Interspeech 2022, respectively, both being key flagship conferences in signal processing and speech technology.


Video analytics has emerged as a powerful tool to help people stay in safety during the Covid-19 pandemic. The premise is that with video surveillance, video analytics can detect virus-carrying individuals as well as analyze people's adherence to mask-wearing and social distancing rules in the city streets, shopping malls, and crowded public transportation platforms, including airports and metro and railroad stations. With the timely acquisition of social behaviors such as emotion, engagement, and body motion of humans both in health and illness, we can detect and track the presence of virus-infected individuals and predict the spread of the disease in crowd areas, in order to help the health agencies to react immediately. In addition to video surveillance, employing multimodal-based human-computer interactive engagement of people can be an effective tool for further screening the infected people from a crowd by observing their social behaviors. In the past, a large number of intensive papers have been published on human behavior understanding in videos which can be divided into the following components: segmentation and tracking, human social behavior analysis in passive surveillance, and social engagement using the multimodal-based interface. In this talk, a comprehensive survey of the recent development of the key algorithms performing analytics of human social behaviors will be presented. The challenges of developing such algorithms will be discussed to identify possible future research directions in this emerging area.

Mike Shou

Asssitant Professor at National University of Singapore

[Topic] Long-form Video Understanding

[Biography / Abstract]


Mike Shou is a tenure-track Assistant Professor at National University of Singapore. He was a Research Scientist at Facebook AI in Bay Area. He obtained his Ph.D. degree at Columbia University in the City of New York. He was awarded Wei Family Private Foundation Fellowship. He received the best student paper nomination at CVPR'17. His team won the first place in the International Challenge on Activity Recognition (ActivityNet) 2017. He is a Fellow of National Research Foundation (NRF) Singapore, Class of 2021.


In this talk, I will present a novel task together with a new benchmark for detecting generic, taxonomy-free event boundaries that segment a whole video into chunks. Conventional work in temporal video segmentation and action detection focuses on localizing pre-defined action categories and thus does not scale to generic videos. Cognitive Science has known since last century that humans consistently segment videos into meaningful temporal chunks. This segmentation happens naturally, without pre-defined event categories and without being explicitly asked to do so. Here, we repeat these cognitive experiments on mainstream CV datasets; with our novel annotation guideline which addresses the complexities of taxonomy-free event boundary annotation, we introduce the task of Generic Event Boundary Detection (GEBD) and the new benchmark Kinetics-GEBD. Our Kinetics-GEBD has the largest number of boundaries (e.g. 32x of ActivityNet, 8x of EPIC-Kitchens-100) which are in-the-wild, open-vocabulary, cover generic event change, and respect human perception diversity. We view GEBD as an important stepping stone towards understanding the video as a whole, and believe it has been previously neglected due to a lack of proper task definition and annotations. Through experiment and human study we demonstrate the value of the annotations. Further, we benchmark supervised and un-supervised GEBD approaches on the TAPOS dataset and our Kinetics-GEBD. We release our annotations and baseline codes at CVPR'21 LOVEU Challenge: https://sites.google.com/view/loveucvpr21.

Fabio Cuzzolin

Professor at Oxford Brookes University

[Topic] ROAD: The ROad event Awareness Dataset for autonomous Driving

[Biography / Abstract]


Fabio Cuzzolin was born in Jesolo, Italy. He received the laurea degree magna cum laude from the University of Padova, Italy, in 1997 and a Ph.D. degree from the same institution in 2001, with a thesis entitled “Visions of a generalized probability theory”. He was a researcher with the Image and Sound Processing Group of the Politecnico di Milano in Milan, Italy, and a postdoc with the UCLA Vision Lab at the University of California at Los Angeles, California. He later joined as a Marie Curie fellow the Perception team at INRIA Rhone-Alpes, Grenoble.

He joined the Department of Computing of Oxford Brookes University in September 2008. He has taken on the role of Head of the Artificial Intelligence and Vision research group in September 2012. The group has taken on the name of Visual Artificial Intelligence Laboratory, part of the School of Engineering, Computing and Mathematics, in 2018. He is a Professor of Artificial Intelligence since January 2016. Since 2020 he is on the Board of the Institute for Ethical AI.

The Visual AI Lab currently runs on a budget of £3 million, with eight live projects funded by the European Union (2), Innovate UK (2), UKIERI, the ECM School, Huawei Technologies and the Leverhulme Trust.

In 2021 the team is projected to comprise around 35 members, including five faculty, nine research fellows, two KTP associates, six Ph.D. students, six MSc and final year students and six external collaborators.

Fabio is a world leader in the field of imprecise probabilities and random set theory, to which he contributed an original geometric approach. His Lab's research spans artificial intelligence, machine learning, computer vision, surgical robotics, autonomous driving, AI for healthcare as well as uncertainty theory. The team is pioneering frontier topics such as machine theory of mind, epistemic artificial intelligence, predicting future actions and behaviour, neurosymbolic reasoning, self-supervised learning and federated learning.

Fabio is the author of 110+ publications, published or under review, including 4 books, 13 book chapters, and 27 journal papers.

He is a four-term member of the Board of Directors of the Belief Functions and Applications Society (BFAS) and was Executive Editor of the Society for Imprecise Probabilities and Their Applications (SIPTA). Fabio was in the Technical Program Committee of 100+ international conferences, including UAI, BMVC, ECCV, ICCV (as Area Chair), IJCAI, CVPR, NeurIPS, AAAI, ICML. He has been on a board of IEEE Fuzzy Systems, IEEE SMC, IJAR, Information Fusion, IEEE TNN and Frontiers.


Autonomous vehicles (AVs) employ a variety of sensors to identify roadside infrastructure and other road users, with much of the existing work focusing on scene understanding and robust object detection. Human drivers, however, approach the driving task in a more holistic fashion which entails, in particular, recognising and understanding the evolution of road events. Testing an AV’s capability to recognise the actions undertaken by other road agents is thus crucial to improve their situational awareness and facilitate decision making.

In this talk we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. ROAD is explicitly designed to test the ability of an autonomous vehicle to detect road events, defined as triplets composed by a moving agent, the actions it performs (possibly more than one, e.g. as in a car concurrently turning left, blinking, and moving-away) and the associated locations. ROAD comprises 22 videos captured as part of the Oxford RobotCar Dataset, which we annotated with bounding boxes to show the location in the image plane of each road event, and is designed to provide an information-rich playground for validating a variety of tasks related to the understanding of road user behaviour, includin cyclists and pedestrians.

The dataset comes with a new baseline algorithm for online road event awareness capable of working incrementally, an essential feature for autonomous driving. Our baseline, inspired by the success of 3D CNNs and single-stage object detectors, is based on inflating RetinaNet along the temporal direction and achieves a mean average precision of 16.8% and 6.1%, respectively, for frame-level event bounding box detection and video-level event tube detection at 50% overlap. Further significant results have been achieved during the recent ROAD @ ICCV 2021 challenge, where a number of participant teams competed to achieve the best performance on three tasks: agent, action and event detection. While promising, these figures do highlight the challenges faced by realistic situation awareness in autonomous driving.

Finally, ROAD is readied to allow scholars to conduct research on exciting new tasks, such as the understanding of complex (road) activities, the anticipation of future road events, and the modelling of sentient road agents in terms of mental states. Further extensions in the form of ROAD-like annotation for other datasets such as Waymo and PIE, logical scene constraints for neuro-symbolic reasoning and intent / trajectory prediction baselines are underway.


Moongu Jeon

Gwangju Institute of Science and Technology (GIST)

Yuseok Bae

Electronics and Telecommunications Research Institute (ETRI)

Kin-Choong Yow

University of Regina

Joonki Paik

Chung-Ang University

Sung-Jea Ko

Korea University

Jinyoung Moon

Electronics and Telecommunications Research Institute (ETRI)

Du Yong Kim

RMIT University

Jeonghwan Gwak

Korea National University of Transportation

Jongmin Yu

Korea Advanced Institute of Science and Technology (KAIST)

Muhammad Aasim Rafique


Younkwan Lee

Gwangju Institute of Science and Technology (GIST)