MediVLM

MediVLM:

A Vision Language Model for Radiology Report Generation from Medical Images

Debanjan Goswami, Ronast Subedi, Shayok Chakraborty

Code

Paper

In this paper, we propose MediVLM, a vision language model (VLM) for radiology report generation from medical images. The proposed model consists of a pre-trained object detector to extract the salient anatomical regions from the images, an image encoder, a text encoder, a module to align the visual and text representations, a cross-attention layer to fuse the two representations, and finally, a transformer-based decoder to generate the final report. MediVLM can generate radiology reports even when no reports are available for training; this is an extremely useful feature, as curating such reports is a labor-intensive task. Further, it computes a severity score (depicting the seriousness of a patient’s medical condition) from the generated radiology reports, which can be used to prioritize patients who need immediate medical attention. Our extensive empirical analyses on three benchmark datasets corroborate the promise and potential of our method against competing baselines.

BibTeX

@inproceedings{goswami2025medivlm, title = {{MediVLM}: A Vision Language Model for Radiology Report Generation from Medical Images}, author = {Goswami, Debanjan and Subedi, Ronast and Chakraborty, Shayok}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, year = {2025}, pages = {10287--10304}}

Figure 1: Visual illustration of MediVLM vs. the baselines on the IU X-Ray dataset. Best viewed in color

@ MediVLM 2025 | EMNLP 2025

Page updated

Google Sites

Report abuse