We propose an emotion-aware music-video cross-modal generative adversarial network (EMVGAN) model to build an affective common embedding space to bridge the heterogeneity gap among different data modalities. We facilitated the learning of common emotional spaces for music-video modalities by using pre-trained models, i.e., music emotion recognition (MER) and video emotion recognition (VER), to extract emotion-related features in the music and video domains, respectively.
The evaluation results revealed that the proposed EMVGAN model can learn affective common representations with convincing performance while outperforming other existing models.
Furthermore, the satisfactory performance of the proposed network encouraged us to undertake the music-video bidirectional retrieval task. The results of the subjective evaluations by the 40 recruited participants indicated a similar consistency and emotional relationship between the retrieved music videos and official music videos.
In this study, we address the emotion-oriented music-video bidirectional retrieval task in the wild.
Compared to content-based information (e.g., object motion or music tempo) that can be derived from visual or musical data directly, extracting emotion-related features is significantly more challenging.
The continuous two-dimensional (2D) valence-arousal (V-A) space is most widely used for emotion mapping. Arousal refers to the state of the intensity of emotions, while valence is defined as positive or negative affectivity. The V-A emotion model is used in this study.
Arousal refers to the state of the intensity of emotions.
Valence is defined as positive or negative affectivity.
Proposed Architecture