Fig: Computation of four different objectives, Barlow twins (BT), graph optimal transport (GOT), masked language modeling (MLM), and image-text matching (ITM), by the proposed VoLTA framework.
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment [Under Submission]
Shraman Pramanick, Li Jing, Sayan Nag, Jiachen Zhu, Hardik Shah, Yann LeCun, Rama Chellappa
Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text-box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.
Fig: The architecture of the proposed model, TransLocator.
Where in the World is this Image? Transformer-based Geo-localization in the Wild [ECCV, 2022] [Paper | Code + Data]
Shraman Pramanick, Ewa M. Nowara, Joshua Gleason, Carlos D. Castillo, Rama Chellappa
Predicting the geographic location (geo-localization) from a single ground-level RGB image taken anywhere in the world is a very challenging problem. The challenges include huge diversity of images due to different environmental scenarios, drastic variation in the appearance of the same location depending on the time of the day, weather, season, and more importantly, the prediction is made from a single image possibly having only a few geo-locating cues. For these reasons, most existing works are restricted to specific cities, imagery, or worldwide landmarks. In this work, we focus on developing an efficient solution to planet-scale single-image geo-localization. To this end, we propose TransLocator, a unified dual-branch transformer network that attends to tiny details over the entire image and produces robust feature representation under extreme appearance variations. TransLocator takes an RGB image and its semantic segmentation map as inputs, interacts between its two parallel branches after each transformer layer, and simultaneously performs geo-localization and scene recognition in a multi-task fashion. We evaluate TransLocator on four benchmark datasets - Im2GPS, Im2GPS3k, YFCC4k, YFCC26k and obtain 5.5%, 14.1%, 4.9%, 9.9% continent-level accuracy improvement over the state-of-the-art. TransLocator is also validated on real-world test images and found to be more effective than previous methods.
Fig: The architecture of the proposed model, MuLOT.
Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection [WACV, 2022] [Paper]
Shraman Pramanick*, Aniket Roy*, Vishal M. Patel
Multimodal learning is an emerging yet challenging research area. In this paper, we deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs. Being a fleeting action, which is dependent across the modalities, sarcasm detection is challenging since large datasets are not available for this task in the literature. Therefore, we primarily focus on resource-constrained training, where the number of training samples is limited. To this end, we propose a novel multimodal learning system, MuLOT (Multimodal Learning using Optimal Transport), which utilizes self-attention to exploit intra-modal correspondence and optimal transport for cross-modal correspondence. Finally, the modalities are combined with multimodal attention fusion to capture the inter-dependencies across modalities. We test our proposed approach for multimodal sarcasm and humor detection on three benchmark datasets - MUStARD (video, audio, text), UR-FUNNY (video, audio, text), MST (image, text) and obtain 2.1%, 1.54%, and 2.34% accuracy improvements over the state-of-the-art.
Fig: The architecture of our proposed model, MOMENTA.
MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets [Findings of EMNLP, 2021] [Paper | Code + Data]
Shraman Pramanick*, Shivam Sharma*, Dimitar Dimitrov, Shad Aktar, Preslav Nakov, Tanmoy Chakraborty
Internet memes have become powerful means to transmit political, psychological, and socio-cultural ideas. Although memes are typically humorous, recent days have witnessed an escalation of harmful memes used for trolling, cyberbullying, and abuse. Detecting such memes is challenging as they can be highly satirical and cryptic. Moreover, while previous work has focused on specific aspects of memes such as hate speech and propaganda, there has been little work on harm in general. Here, we aim to bridge this gap. We focus on two tasks: (i) detecting harmful memes, and (ii) identifying the social entities they target. We further extend a recently released HarMeme dataset, which covered COVID-19, with additional memes and a new topic: US politics. To solve these tasks, we propose MOMENTA (MultimOdal framework for detecting harmful MemEs aNd Their tArgets), a novel multimodal deep neural network that uses global and local perspectives to detect harmful memes. MOMENTA systematically analyzes the local and the global perspective of the input meme (in both modalities) and relates it to the background context. MOMENTA is interpretable and generalizable, and our experiments show that it outperforms several strong rivaling approaches.
Fig: The complete architecture of FLORAL, our proposed Factorized Multimodal Transformer based decoder-only Language Model
See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization [Knowledge-Based Systems, Elsevier]
Yash Atri*, Shraman Pramanick*, Vikram Goyal, Tanmoy Chakraborty
In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities and generate a fluent textual summary. However, existing methods use short videos as the visual modality and short summary as the ground-truth, therefore, perform poorly on lengthy videos and long ground-truth summary. Additionally, there exists no benchmark dataset to generalize this task on videos of varying lengths.
In this paper, we introduce AVIATE, the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We use the abstract of corresponding research papers as the reference summaries, which ensure adequate quality and uniformity of the ground truth. We then propose FLORAL, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task. FLORAL utilizes an increasing number of self-attentions to capture multimodality and performs significantly better than traditional encoder-decoder-based networks. Extensive experiments illustrate that FLORAL achieves significant improvement over the baselines in both qualitative and quantitative evaluations on the existing How2 dataset for short videos and newly introduced AVIATE dataset for videos with diverse duration, beating the best baseline on the two datasets by 1.39 and 2.74 ROUGE-L points respectively.
Fig: Example of explanation by LIME on both visual and textual modalities and visualization of bias in V-BERT for both tasks.
Detecting Harmful Memes and Their Targets [Findings of ACL, 2021] [Paper | Code + Data]
Shraman Pramanick, Dimitar Dimitrov, Rituparna Mukherjee, Shivam Sharma, Shad Aktar, Preslav Nakov, Tanmoy Chakraborty
Among the various modes of communication in social media, the use of Internet memes has emerged as a powerful means to convey political, psychological, and socio-cultural opinions. Although memes are typically humorous in nature, recent days have witnessed a proliferation of harmful memes targeted to abuse various social entities. As most harmful memes are highly satirical and abstruse without appropriate contexts, off-the-shelf multimodal models may not be adequate to understand their underlying semantics. In this work, we propose two novel problem formulations: detecting harmful memes and the social entities that these harmful memes target. To this end, we present HarMeme, the first benchmark dataset, containing 3544 memes related to COVID-19. Each meme went through a rigorous two-stage annotation process. In the first stage, we labeled a meme as very harmful,
partially harmful, or harmless; in the second stage, we further annotated the type of target(s) that each harmful meme points to an individual, organization, community, or society/general public/other. The evaluation results using ten unimodal and multimodal models highlight the importance of using multimodal signals for both tasks. We further discuss the limitations of these models and we argue that more research is needed to address these problems.
Fig: The architecture of the proposed model, MHA-Meme.
Exercise? I thought you said ’Extra Fries’: Leveraging Sentence Demarcations and Multi-hop Attention for Meme Affect Analysis [ICWSM 2021] [Paper | Code]
Today’s Internet is awash in memes as they are humorous, satirical, or ironic which make people laugh. According to a survey, 33% of social media users in the age bracket [13 − 35] send memes every day, whereas more than 50% send in every week. Some of these memes spread rapidly within a very short time frame, and their virality depends on the novelty of their (textual and visual) content. A few of them convey positive messages, such as funny or motivational quotes; while others are meant to mock/hurt someone’s feelings through sarcastic or offensive messages. Despite the appealing nature of memes and their rapid emergence on social media, effective analysis of memes has not been adequately attempted to the extent it deserves. Recently, in SemEval’20, a pioneering attempt has been made in this direction by organizing a shared task on ‘Memotion Analysis’ (meme emotion analysis). As expected, the competition attracted more than 500 participants with the final submission of [23 − 32] systems across three sub-tasks. In this paper, we attempt to solve the same set of tasks suggested in the SemEval’20-Memotion Analysis competition. We propose a multi-hop attention-based deep neural network framework, called MHA-Meme, whose prime objective is to leverage the spatial-domain correspondence between the visual modality (an image) and various textual segments to extract fine-grained feature representations for classification. We evaluate MHA-Meme on the ‘Memotion Analysis’ dataset for all three sub-tasks - sentiment classification, affect classification and affect class quantification. Our comparative study shows state-of-the-art performances of MHA-Meme for all three tasks compared to the top systems that participated in the competition. Unlike all the baselines which perform inconsistently across all three tasks, MHA-Meme outperforms baselines in all the tasks on average. Moreover, we validate the generalization of MHA-Meme on another set of manually annotated test samples and observe it to be consistent. Finally, we establish the interpretability of MHA-Meme.
Fig: Complete overview of our proposed scheme.
Localizing and Grasping of 3-D Objects by a Vision-Actuated Robot Arm using Brain-Computer Interface [Under Review in Biomedical Signal Processing and Control, Elsevier ]
Arnab Rakshit, Shraman Pramanick, Anurag Bagchi, Amit Konar
A major drawback of a Brain-Computer Interface-based robotic manipulation is the complex trajectory planning of the robot arm to be carried out by the user for reaching and grasping an object. Especially the grasping task requires accurate alignment of the robot gripper, which is challenging for the subject to precisely control the position and orientation of the gripper with cognitive commands. The present paper proposes an intelligent solution to the existing problem by incorporating a novel CNN-based grasp detection network that enables the robot to reach and grasp the desired object autonomously. The scenario of the environment is presented in an LCD monitor using the RGBD camera mounted on a robot link. The subject uses motor imagery brain signals to control the pan and tilt angle of the camera. Objects appearing on the screen are selected using the P300 brain pattern. The robot uses inverse kinematics along with the RGBD camera information to autonomously reach the selected object. A CNN-based novel grasp detection network has been proposed which is able to predict the accurate grasp in both the overlapping and non-overlapping objects. This network uses a simultaneous object and grasp detection to affiliate each estimated grasp with its corresponding object. The overall BCI system outperforms other comparative systems involving manual trajectory planning significantly. The overall accuracy, steady-state error, and settling time of the proposed system is 95.4%, 0.05%, and 15.92 s, respectively. The system also shows a significant reduction of the cognitive load of the operating subjects during the experiment. A comparison of the cognitive load of the subject under the proposed scheme and two other schemes shows that the proposed scheme imposes the least cognitive load compared to the others.