MARIO: Modality-Aware Attention and Modality-Preserving
Decoders for Multimedia Recommendation

Abstract

We address the multimedia recommendation problem, which utilizes items' multimodal features, such as visual and textual modalities, in addition to interaction information. While a number of existing multimedia recommender systems have been developed for this problem, we point out that none of these methods individually capture the influence of each modality at the interaction level. More importantly, we experimentally observe that the learning procedures of existing works fail to preserve the intrinsic modality-specific properties of items. To address above limitations, we propose an accurate multimedia recommendation framework, named MARIO, based on modality-aware attention and modality-preserving decoders. MARIO predicts users' preferences by considering the individual influence of each modality on each interaction while obtaining item embeddings that preserve the intrinsic modality-specific properties. The experiments on four real-life datasets demonstrate that MARIO consistently and significantly outperforms seven competitors in terms of the recommendation accuracy: MARIO yields up to 14.61% higher accuracy, compared to the best competitor.