Scene understanding

hIGH-LEVEL understanding of scenes with people

A general understanding of human environments is a fundamental task for real-world applications that focus on monitoring or security. The number of surveillance cameras set in public spaces, such as train stations, airports, and mass events, has been increasing steadily in recent years. The sheer amount of information collected makes processing it a titanic task. Distributed multi-target tracking algorithms automate this process, determining the position of every target in every frame without requiring a central server to carry out all the computations. 

Distributed systems process the information locally, are lighter in communication demands, more robust to failures, and faster in processing time than centralized ones. However, their implementation is more challenging, due to the different estimations present in each node and the local decision-making based on partial information.

Our framework [1] overcomes these challenges by exploiting the synergies between complementary modules, known for their good performance on simpler versions of the problem and only used independently in the past. We present an approach for multi-target tracking in a distributed camera network boosting the implementation of a Distributed Kalman Filter and a novel distributed method for high-level information management to address the consensus between nodes, and a local data association process that uses a re-identification network and geometry information for making local decisions with partial information. 

In environments where the cameras cover large areas, they commonly do not have overlapping views. In these scenarios, the problem of re-identifying whether an individual is already known to the system or not becomes highly significant. Traditional re-identification methods narrow down the problem to finding a match between a new query and all the existing labeled images already gathered, known as the gallery.  To be effective in the real world, the gallery should adapt and evolve as new knowledge arrives


We propose a novel framework for real-world person re-identification [2] focusing on a self-adaptive gallery that evolves over time in an unsupervised fashion. The gallery is expanded dynamically to identify new individuals and build their appearance models with representative information making efficient use of the system resources. The samples that provide the most representative appearance descriptor of each person are selected to be included in the gallery.

One of the biggest problems in the development of perception solutions is the arduous task of collecting and labeling real-world data. In order to overcome this bottleneck, photorealistic simulators are becoming increasingly popular by letting you design specific scenes with control over environmental conditions such as the light, and by obtaining automatically labeled data.  However, everything that glitters is not gold! Start working with any simulators is time-consuming, there is a steep learning curve and part of the development is the tedious process of collecting and blending all suitable actors needed in the scenarios. 

To help researchers get started with photorealistic environments, we present a framework using Unreal and AirSim to easily create pedestrian scenarios [3]. Our framework includes the three main components needed to simulate pedestrians. First, a trajectory plugin to define multiple paths to be followed in random or customized modes. Second, multiple pedestrian models that are ready-to-use by simply dragging and dropping them into the environment. Finally, a Python API can extract environment metadata such as pedestrian ground truth, sensor states, and timestamps. Taking advantage of all these tools, you can easily create a photo-realistic scene with people as mobile targets. 


Static RGB multi-camera systems are widely used in real-world applications. However, their restricted coverage and perspective impact their effectiveness in acquiring deeper knowledge from certain scenarios. Thus, only by enabling the free movement of some camera nodes to gain knowledge of what other sensors are currently observing, the overall system will be able to maximize the information captured from the environment. Consequently, the collaboration of static and autonomous mobile cameras (drones) presents a compelling approach for monitoring open and large spaces. This "dream team" of cameras offers synergies that enhance the capabilities of surveillance systems. 

Following this idea, we built a framework that combines static cameras with autonomous drones for monitoring applications [5]. Our novel hybrid multi-camera system performs a distributed tracking of people in the scene [2] to obtain a global understanding of the environment and the mobile cameras use this information to decide where to go next to acquire more detailed knowledge from individuals [4]. To test the proposed framework safely, we leverage the photo-realistic simulator [3] to fly virtually the drones with no risk to the physical integrity of real human beings. The details of the recorded sequences used for the evaluation, including static and mobile cameras working in collaboration, can be found here.

References

[1] S. Casao, A.C. Murillo and E. Montijano, “Distributed Multi-Target Tracking in Camera Networks,” 2021. International Conference on Robotics and Automation (ICRA). 

[2] S. Casao, P.Azagra, A.C. Murillo and E. Montijano,  "A Self-Adaptive Gallery Construction Method for Open-World Person Re-Identification." Sensors 2023. 

[3] S. Casao, A. Otero, A. Serra-Gómez, A.C. Murillo, J. Alonso-Mora and E. Montijano, “A Framework for Fast Prototyping of Photo-realistic Environments with Multiple Pedestrians," 2023. International Conference on Robotics and Automation (ICRA).

[4] Á. Serra-Gómez,  E. Montijano  W. Böhmer J. Alonso-Mora, "Active Classification of Moving Targets With Learned Control Policies", IEEE Robotics and Automation Letters (RAL) 2023.

[5] S. Casao, Á. Serra-Gómez, A. C. Murillo, W. Böhmer J. Alonso-Mora, E. Montijano, “Distributed multi-target tracking and active perception with mobile camera networks," Computer Vision and Image Understanding (CVIU) 2024.

Source code

Framework for photo-realistic pedestrian scenarios: https://github.com/saracasao/Pedestrian_Environment