Mission
To build intelligent multimodal agents that can retrieve, reason, and generate knowledge across modalities to solve real-world problems.
I am always looking to work with motivated research students/collaborators on topics around this mission. Please feel free to drop me an email at manishg.iitb@gmail.com if you would like to work with me on multimodal agents and reasoning systems.
Ongoing Projects
Sports commentary generation (with IITH)
Cross-Video reasoning (with IITP)
Scientific Multimodal Reasoning (with IITP)
Multimodal security (with IITH)
Multimodal reasoning chains (with folks at Microsoft)
Radiology image understanding (with IITD)
Multimodal neuroscience with videos (with IIITH)
Tutorials on YouTube
Publications
Multimodal retrieval
Full: Yogesh Kumar, Uday Agarwal, Manish Gupta, Anand Mishra. Aligning Moments in Time using Video Queries. [Slides] International Conference on Computer Vision 2025 (ICCV). [Also selected for the Vision India session at ICVGIP 2025] Oct 19-Oct 23, 2025, Honolulu, Hawaii.
Full: Uday Agarwal, Yogesh Kumar, Abu Shahid, Prajwal Gatti, Manish Gupta, Anand Mishra. ChapVidMR: Chapter-based Video Moment Retrieval using Natural Language Queries. [Slides] [Video] [Poster] Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP). Bangalore, India. Dec 13-15, 2024.
Full: Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. [Slides] [Video] [Poster] Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI). Feb 20-27, 2024. Vancouver, Canada.
Full: Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Gupta, Anand Mishra. Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering. [Slides] The 32nd International Joint Conference on Artificial Intelligence (IJCAI-23). Aug 19th-Aug 25, 2023. Macao, S.A.R. Also accepted at the AI India track of AIML-Systems 2023 [Slides].
Domain-specific Multimodal Reasoning
Full (Findings): Sarmistha Das, Vaibhav Vishal, Syed Ibrahim Ahmad, Sriparna Saha, Manish Gupta. FIND: Towards Building the Financial Multimodal Reasoning Aware Q&A Framework for INDIC Languages [Poster] [Slides] [Video]. The 64th Annual Meeting of the Association for Computational Linguistics (ACL). July 2–7, 2026. San Diego, California, USA.
Short (Findings): Himanshu Wadhwa, T Karthikeyan, Mausam, Manish Gupta. Towards Multimodal Question Answering in Educational Domain. [Slides] [Poster] [Video] International Joint Conference on Natural Language Processing & Asia-Pacific Chapter of the Association for Computational Linguistics 2025 (AACL). Dec 20-24, 2025. Mumbai, India.
Full: Rahul Mehta, Bhavyajeet Singh, Vasudeva Varma, Manish Gupta. CircuitVQA: A Visual Question Answering Dataset for Electrical Circuit Images. [Slides] [Video] [Poster] The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD). Sep 9-13, 2024. Vilnius, Lithuania.
Full: Harsh Agrawal, Aditya M. Mishra, Manish Gupta and Mausam. Multimodal Persona Based Generation of Comic Dialogs. [Slides] [Poster] [Video] [Video by Mausam] The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). July 9th-14th, 2023. Toronto, Canada.
Full: Pranay Gupta, Manish Gupta. NewsKVQA: Knowledge-Aware News Video Question Answering. [Slides] The 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Chengdu, China. May 16-19, 2022.
Infographics
Full: Shivank Garg, Sankalp Mittal, Manish Gupta. Text2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions [Slides] [Poster] [Video]. The Fourteenth International Conference on Learning Representations (ICLR). Apr 23-27, 2026. Rio de Janeiro, Brazil.
Full: Shreya Shukla, Nakul Sharma, Manish Gupta, Anand Mishra. PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures. [Poster] The 39th Annual AAAI Conference on Artificial Intelligence (AAAI). Feb 25 - Mar 4, 2025. Philadelphia, Pennsylvania, USA.
Multimodal Generation
Full (Industry): Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta. Router-Suggest: A Router-based Framework for Auto-Completions in Visually-Grounded Conversations. [Slides] [Video] 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Mar 24-29, 2026. Rabat, Morocco.
Full (Findings): Arpan Phukan, Manish Gupta, Asif Ekbal. Learning to Ask: Multi-Decoder Fine-Tuning for Multi-Hop Visual Question Generation with External Knowledge. [Slides] [Poster] [Video] 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Mar 24-29, 2026. Rabat, Morocco.
Full: Sarmistha Das, R E Zera Marveen Lyngkhoi, Kirtan Jain, Vinayak Goyal, Sriparna Saha and Manish Gupta. When Words Can't Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset. [Slides] Conference on Information and Knowledge Management (CIKM) Applied Research Papers. Nov 10-14, 2025. Seoul, South Korea.
Full: Arpan Phukan, Manish Gupta, Asif Ekbal. ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos. [Slides] [Poster] [Video] The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Nov 12 -16, 2024. Miami, Florida, USA.
Full: Prajwal Gatti, Anand Mishra, Manish Gupta and Mithun Das Gupta. VisToT: Vision-Augmented Table-to-Text Generation. [Poster] [Slides] [Short Video] The 2022 Conference on Empirical Methods in Natural Language Processing. Dec 7–11, 2022. Abu Dhabi. Also accepted at the AI India track of AIML-Systems 2023. [Slides]
Multimodal Neuroscience
Tutorial: Subba Reddy Oota, Tanmoy Chakraborty, Manish Gupta, Raju S. Bapi. Brain-Inspired AI 2.0: Aligning Language Models Across Languages and Modalities. [Website] [Slides for Part 1] The 40th Annual AAAI Conference on Artificial Intelligence (AAAI). Jan 20-27, 2026. Singapore.
Full (Main): Padakanti Srijith, Khushbu Pahwa, Radhika Mamidi, Bapi Raju Surampudi, Manish Gupta, Subba Reddy Oota. Aligning Text/Speech Representations from Multimodal Models with MEG Brain Activity During Listening. [Senior Area Chair Highlight] [Slides] [Poster] The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Nov 5-9, 2025. Suzhou, China.
Full: Subba Reddy Oota, Akshett Rai Jindal, Ishani Mondal, Khushbu Pahwa, Srinath Namburi, Manish Shrivastava, Maneesh Singh, Bapi S. Raju, Manish Gupta. Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain). [Slides] [Poster] The Thirteenth International Conference on Learning Representations (ICLR). Apr 24–28, 2025. Singapore.
Full: Subba Reddy Oota, Khushbu Pahwa, Mounika Marreddy, Maneesh Kumar Singh, Manish Gupta, Bapi Raju Surampudi. Multi-modal brain encoding models for multi-modal stimuli. [Slides] [Poster] The Thirteenth International Conference on Learning Representations (ICLR). Apr 24–28, 2025. Singapore.
Full: Subba Reddy Oota, Jashn Arora, Vijay Rowtula, Manish Gupta, Raju S. Bapi. Visio-Linguistic Brain Encoding. [Slides] [Video] The 29th International Conference on Computational Linguistics. Oct 12-17, 2022. Gyeongju, Republic of Korea.
Full: Subba Reddy Oota, Jashn Arora, Manish Gupta, Raju S. Bapi. Multi-view and Cross-view Brain Decoding. [Slides] [Video] The 29th International Conference on Computational Linguistics. Oct 12-17, 2022. Gyeongju, Republic of Korea.
Medical Imaging
Full: Subba Reddy Oota, Vijay Rowtula, Shahid Mohammed, Minghsun Liu, Manish Gupta. WSNet: Towards An Effective Method for Wound Image Segmentation. [Slides] IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Jan 3-7, 2023. Waikoloa, Hawaii.
Full: Subba Reddy Oota, Vijay Rowtula, Shahid Mohammed, Jeffrey Galitz, Minghsun Liu, Manish Gupta. HealTech - A System for Predicting Patient Hospitalization Risk and Wound Progression in Old Patients. Winter Conference on Applications of Computer Vision. [Slides] [Video] Jan 5-9, 2021. Online.
Full: Sahil Chelaramani, Manish Gupta, Vipul Agarwal, Prashant Gupta, Ranya Habash. Multi-Task Knowledge Distillation for Eye Disease Prediction. Winter Conference on Applications of Computer Vision. [Slides] [Video] Jan 5-9, 2021. Online.
Full: Manish Gupta, Chetna Das, Arnab Roy, Prashant Gupta, G. Radhakrishna Pillai, Kamlakar Patole. Region of Interest Identification for Cervical Cancer Images [Slides] [Video]. IEEE ISBI 2020 International Symposium on Biomedical Imaging (ISBI 2020). Apr 3-7, 2020. Iowa City, Iowa, USA.
Workshop: Subba Reddy Oota, Vijay Rowtula, Shahid Mohammed, Jeffrey Galitz, Minghsun Liu, Manish Gupta. A Deep Multi-Modal Method for Patient Wound Healing Assessment [Slides]. Medical Imaging at NuerIPS 2019. Dec 14, 2019. Vancouver, Canada.
Full: Sahil Chelaramani, Manish Gupta, Vipul Agarwal, Prashant Gupta and Ranya Habash. Multi-Task Learning for Eye Disease Prediction. [Slides] [Poster] The 5th Asian Conference on Pattern Recognition (ACPR 2019). Nov 26-29, 2019. Auckland, New Zealand.
Multimodal safety
Short: Shrey Gupta, Pratyush Priyadarshi and Manish Gupta. Hateful Comment Detection and Hate Target Type Prediction for Video Comments. [Slides] [Video] [Poster] 32nd ACM International Conference on Information and Knowledge Management (CIKM). Oct 21-25, 2023. Birmingham, UK.
Full (Dataset Track): Mithun Das, Rohit Raj, Punyajoy Saha, Binny Mathew, Manish Gupta, Animesh Mukherjee. HateMM: A Multi-modal Dataset for Hate Video Classification. [Poster] The 17th International AAAI Conference on Web and Social Media (ICWSM). Jun 5-8, 2023. Limassal, Cyprus.
Full: Tanmay Sachan, Nikhil Pinnaparaju, Manish Gupta, Vasudeva Varma. SCATE: Shared Cross Attention Transformer Encoders for Multimodal Fake News Detection. ASONAM 2021 Industry Track. [Slides], Nov 8-11, 2021, Online.
Full: Mohit Chandra, Dheeraj Pailla, Himanshu Bhatia, Aadilmehdi Sanchawala, Manish Gupta, Manish Shrivastava and Ponnurangam Kumaraguru. "Subverting the Jewtocracy": Online Antisemitism Detection Using Multimodal Deep Learning. [Slides] Web Science 2021.
Short: Dhruv Khattar, Jaipal Singh Goud, Manish Gupta, Vasudeva Varma. MVAE: Multimodal Variational Autoencoder for Fake News Detection [Poster]. The Web Conference 2019. May 13-17, 2019. San Francisco.
Principal Investigator: Manish Gupta
Collaborators
IIT Jodhpur: Anand Mishra, Yogesh Kumar, Uday Agarwal, Prajwal Gatti, Abhirama Subramanyam Penamakuri, Shreya Shukla, Nakul Sharma, Kshitij Parikh, Dhriti Paul,
IIIT Hyderabad: Raju S. Bapi, Subba Reddy Oota, Vasudeva Varma, Rahul Mehta, Bhavyajeet Singh, Pranay Gupta, Manish Shrivastava, Ponnurangam Kumaraguru, Jashn Arora, Vijay Rowtula, Padakanti Srijith, Radhika Mamidi, Mohit Chandra, Tanmay Sachan, Nikhil Pinnaparaju, Shrey Gupta, Pratyush Priyadarshi, Dhruv Khattar
IIT Delhi: Mausam, T Karthikeyan, Tanmoy Chakraborty, Himanshu Wadhwa, Harsh Agrawal, Aditya M. Mishra
Microsoft: Sahil Chelaramani, Prashant Gupta, Chetna Das, Mithun Das Gupta,
IIT Patna: Asif Ekbal, Sriparna Saha, Arpan Phukan, Sarmistha Das, Vaibhav Vishal, Syed Ibrahim Ahmad
IIT Kharagpur: Pawan Goyal, Animesh Mukherjee, Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Punyajoy Saha, Binny Mathew
Others: Maneesh Singh, Shivank Garg, Khushbu Pahwa, Shahid Mohammed, Minghsun Liu, Ranya Habash, Sankalp Mittal, Arnab Roy, G. Radhakrishna Pillai, Kamlakar Patole,