As a canonical imaging sensor, RGB images have been used as inputs in many computer vision and robotics researches. These works focused on mining more useful information from only RGB image to be robust in various conditions. While the vast majority of camera networks still employ traditional RGB sensors, recognizing objects in case of illumination variation, shadows, and low external light is still challenging open issue. Hence, this led to the question of how it would be possible to robustly perceive the world for all-day.
We believe that the answer will rely on the use of alternatives to RGB sensors such as depth or thermal imaging devices. In recent times, there have been two main streams using multi-modal information. The first is the methodology using multi-modal information during training and inference. In fact, these works have shown that the multiple image modalities can be used simultaneously to produce better recognition models than either modality alone in the challenging scenarios to RGB images. Another approach is using multi-modal information for better feature learning during training, and the trained model performed well on RGB images alone as input during inference as known as cross-modal representations or hallucinations. We think that the second is a kind of mining more information from RGB images, but using multi-modal information.
For these purposes, the thermal imaging device has a strong advantage if used to capture images in the worlds, as this type is less affected by light changes under highly lit and dark conditions. For example, the depth from a single thermal image outperformed RGB-based approaches at night and even it showed the reasonable performance in day-time, compared to RGB-based approaches. In a less lighting condition, thermal image can be useful to boost the RGB-based performance in the pedestrian detection, and it has the superiority for detecting drivable regions at night.
From these perspectives, we think that our dataset can be useful to deal with many tasks of autonomous system with respect to multi-modal data fusion approaches as the first way, and RGB-based approaches from multi-modal feature learning.