Target is to Distill Knowledge Learned by a Large Network Trained with Multi-Modal Data into a Compact Network Trained with Single Data Modality
Modern data-driven algorithms find appropriate applications in healthcare and autonomous systems with promising potential for many other sectors. These algorithms most often extract features from input data collected from a single sensor. For instance, images in RGB color spaces are de facto settings in object classification and detection for autonomous navigation due to the imaging hardware dependency and the fact that it can reliably enable most colors to be perceptible through human eyes. However, different data modality provides different benefits when exploiting the features by a learning-oriented model. IR(Infrared) in medical imaging can provide useful information by using zonal temperature as part of the feature space. For navigation in self-driving vehicles, LiDAR can provide depth information with higher accuracy than RGB as it relies on reflection from objects rather than semantic detection. To truly utilize the available resources on the fly, it would clearly be helpful to combine the strength of different modalities, which presents us computationally expensive algorithms.
Traditional methods mostly fuse features extracted from different data modalities and accomplish tasks by further processing (detecting, classifying etc.) the fused features. To learn meaningful features, often those methods rely on separate networks to extract features from each modality resulting in a huge memory consuming end-to-end setup.
Traditional methods mostly fuse features extracted from different data modalities and accomplish tasks by further processing (detecting, classifying etc.) the fused features. To learn meaningful features, often those methods rely on separate networks to extract features from each modality resulting in a huge memory consuming end-to-end setup.
we have utilized the strength of multiple (two) modalities as well as maintaining a compact network in inference stage. The task is executed in two ways:
Learning task by training a large network (Teacher Network) through combining features from multi-modal data
Distill the knowledge into a compact network (Student Network) in an adversarial fashion and reach the performance of teacher network using data from single modality during inference. This relinquishes the need for extra sensors to capture data of different modalities in real-time scenario.
Teacher Network Training. At first, a larger network i.e., the teacher network is trained to capture data distribution of different modalities. In our experiments, images from RGB and NIR (Near Infrared) domain have been used. Two separate branches at the base of the network are used to extract features from both input modalities. Extracted features are concatenated in the mid (mid-level fusion) and passed through the rest of the network (classifier). A probability distribution for each image pair is obtained after applying SoftMax on the final logits of the network. The network is trained by cross-entropy loss using predicted (Ypred) available ground-truth labels (Y). The objective LCE is mentioned in Fig. 1.
Figure 1: Training of teacher network with data from both modalities
Student Network Training. Features from trained teacher network have been utilized to learn feature for the compact student that operates on only one data modality and is going to be deployed in real-time setting. Features are extracted from the pre-trained teacher using both modalities (RGB, NIR) and from the student network (G) using RGB image only. Assuming features (x) from teacher as real, the student network (G) is trained in an adversarial fashion assuming its generated feature (G(z)) as synthetic as depicted in Figure 2. The student network is also supervised by the ground-truth labels so that its learned features don’t get diverged. The adversarial objective LGAN, supervised loss LCE and overall objective L is mentioned in Fig. 2.
Figure 2: Adversarial training of student network with additional supervision
The project is still in developing stage. We will get back with the result as soon as possible.