Our cross-modal pipeline consists of pre-trained steps and classification steps. First of all, we choose VGG Net and ResNet models for videos and YamNet for audios. By removing the last layer of these convolutional neural networks and applying average pooling, we get the embeddings. Then the group input the pre-trained audio and video data into a translation layer, which is constructed by a 2-layer multi-layer perceptron (MLP) network. The translated encoding layer uses Contrastive Loss to minimize the distance between co-occurring embeddings, making the modalities mapped to each other. Furthermore, we applied PCA to reduce the number of components to 128. Finally, we divide 80% of the data into the training set and 20% into the test set. The object counts in each frame have been classified into four levels: Free, Few, Medium, Busy. And we get the accuracy of our cross-modal models.

You can check the difference of natural clusters of translated and untranslated embeddings in our t-SNE playground.

Our Strengths

  • Able to work with either and both audio and image.

  • Mitigates the limitations of poor visibility or noisy environments.

  • Competitive classification performance.

Performance

The performances between Single-Modal and Multi-Modal are shown below.

We input different weights of target modality to the existed modality and compare the performance between multi-modal models whether using translated embeddings. The accuracy results are shown as follows.

You can read the figures like this example: the data point (200, Audio-Random Forest) on the left plot shows the results when the model is trained by a Random Forest classifier with all untranslated video samples and 200 untranslated target audio samples and gets tested with untranslated audio data.

We can see the models trained by translated embeddings are more robust. Since translation projects co-occurring embeddings to each other before, it improves the accuracy when the size of the target modality is low (e.g. adding less than 1000 video embeddings). In conclusion, Our cross-modality model could perform better when lacking one modality.

According to the plots, the more target modalities input into a model, the higher performance we get. However, the combination of modalities could reach a good performance even if we don’t use the entire dataset. The original size of our embeddings has 5076 records, many classifiers obtain a similar accuracy that can be obtained from all embeddings when just using 3000 records. So, the multi-modal models could reduce the amount of data to a certain extent.