A Vision Language Model for Radiology Report Generation from Medical Images
Debanjan Goswami, Ronast Subedi, Shayok Chakraborty
A Vision Language Model for Radiology Report Generation from Medical Images
In this paper, we propose MediVLM, a vision language model (VLM) for radiology report generation from medical images. The proposed model consists of a pre-trained object detector to extract the salient anatomical regions from the images, an image encoder, a text encoder, a module to align the visual and text representations, a cross-attention layer to fuse the two representations, and finally, a transformer-based decoder to generate the final report. MediVLM can generate radiology reports even when no reports are available for training; this is an extremely useful feature, as curating such reports is a labor-intensive task. Further, it computes a severity score (depicting the seriousness of a patient’s medical condition) from the generated radiology reports, which can be used to prioritize patients who need immediate medical attention. Our extensive empirical analyses on three benchmark datasets corroborate the promise and potential of our method against competing baselines.
BibTeX
Figure 1: Visual illustration of MediVLM vs. the baselines on the IU X-Ray dataset. Best viewed in color
@ MediVLM 2025 | EMNLP 2025