Medical VLM

Towards Reliable Medical Report Generation

Radiology report generation (R2Gen) seeks to automatically translate medical images into clinically meaningful reports. While recent models can produce fluent text, they often struggle to faithfully reflect image evidence, leading to hallucinated or unreliable statements. Our work addresses this challenge through systematic works that improve how models see, reason, and are evaluated. We strengthen vision–language alignment to better ground reports in images, incorporate structured clinical knowledge to guide medically plausible generation, and develop human-aligned evaluation methods that reflect how radiologists judge report quality. In addition, inspired by multi-specialist clinical practice, we introduce the METransformer, which enables the model to attend to complementary image regions through coordinated expert reasoning.

Strategy I: Bridging Images and Reports for Reliable Radiology Report Generation

Z. Wang, L. Zhou*, L. Wang, and X. Li, "A Self-boosting Framework for Automated Radiographic Report Generation", CVPR, 2021

In this work, we proposed a self-boosting framework learning highly correlated image and text features so that even finer visual changes could be narrated in the generated reports. This is achieved through explicitly aligning image and text features through an auxiliary task of image-text matching (ITM). ITM and report generation (RG) are built as two branches of a deep learning model and jointly trained through our proposed self-boosted triple loss to boost mutual performance. The ITM branch learns strongly correlated visual and text features, and the text-correlated visual features are passed to the RG branch to help it generate high-quality reports. In tur, the reports with improved quality from the RG branch are passed to the ITM branch as harder samples. This enforces the ITM branch to keep enhancing its feature learning so that even finer mismatch between the image and the generated report could be identified. The three interactions last for the whole training procedure and make the model to gradually improve itself for promoting the ultimate report generation.

Z. Wang, L. Wang, X. Li, and L. Zhou*, "Diagnostic Captioning by Cooperative Task Interactions and Sample-graph Consistency”, IEEE TPAMI, 2025

We further extended our self-boosting framework by introducing a retrieval-based strategy aligning the image-sample space and the report-sample space to achieve consistent image and text feature embeddings. To achieve this, both the image sample-graph and the report sample-graph are built. By requiring the two graphs to have the same structure, the sample-graph of the embedded ground-truth reports could be used as the target to train the sample-graph of the embedded images. In this way, two similar but different ground-truth reports will correspond to two close but different visual embeddings. This strategy is implemented as the sample-graph consistency loss imposed at batch-level, and integrated into our self-boosting framework as additional regularization.

B. P. Voutharoja, L. Wang, and L. Zhou, "Automatic Radiology Report Generation by Learning with Increasingly Hard Negatives", ECAI 2023 (Oral)

In this work, we introduce a framework that learns discriminative image–report features by contrasting them with hard negatives. To enhance feature discrimination, the difficulty of negatives is progressively increased during training. We formulate this as a min–max alternating optimization: given current hard negatives, features are updated by minimizing report-generation losses, after which harder negatives are generated by maximizing an maximising a loss reflecting image-report alignment. This iterative process yields a model capable of producing more specific and accurate reports.

Strategy II: Embedding Clinical Knowledge to Enhance Report Trustworthiness

Z. Wang, H. Han, L. Wang, X. Li, and L. Zhou*, "Automated Radiographic Report Generation Purely on Transformer: A Multi-criteria Supervised Approach", IEEE Trans on Medical Imaging, 2022

In this work, we leverage the transformer model's ability to capture long-range dependencies in both image regions and sentence words for R2Gen. Our model adopts a pure transformer architecture, utilizing the vision transformer as the image encoder and a memory-driven transformer as the report decoder for effective long-text generation. Training involves multi-criteria supervision, incorporating three loss terms for visual-textual alignment, multi-disease classification to capture disease-related features, and word-importance weighting to capture less frequent but critical key words in the report, thereby improving overall report generation.

Z. Wang, M. Tang, L. Wang, X. Li, and L. Zhou*, "A Medical Semantic-Assisted Transformer for Radiographic Report Generation", MICCAI, 2022

In this paper, we replaced the image-text matching loss with the pretrainded CLIP model and proposed a memory-augmented sparse attention block utilizing bilinear pooling to capture the higher-order interactions between the input fine-grained image features while producing sparse attention. Moreover, we introduce a novel Medical Concepts Generation Network (MCGN) to predict fine-grained semantic concepts and incorporate them into the report generation process as guidance. For medical tag prediction, we proposed to use Radgraph to extract 768 medical concepts, which are fine-grained tags than the conventional dozens of medical tags used in the existing R2Gen models.

Y. Li, Z. Wang, Y. Liu, L. Liu, L. Wang, and L. Zhou. “KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models” MICCAI, 2024 (Early Accept)

Despite the wealth of knowledge within LLMs, efficiently triggering relevant knowledge within these large models for specific tasks like R2Gen poses a critical research challenge. This paper presents KARGEN, a Knowledge-enhanced Automated radiology Report GENeration framework based on LLMs. Utilizing a frozen LLM to generate reports, the framework integrates a knowledge graph to unlock chest disease-related knowledge within the LLM to enhance the clinical utility of generated reports. The knowledge graph is utilized to distill disease-related features, which are then fused with the image regional features via Mixture of Experts. The fused features attend to both disease and normal information and are used to better prompt LLM for report generation.

ZL. Chen, Y. Li, Z. Wang, P. Gao, J. Barthelemy, L. Zhou, and L. Wang, "Enhancing Radiology Report Generation via Multi-Phased Supervision", IEEE Transactions on Medical Imaging (IEEE-TMI), 2025

In this work, we proposed a multi-phase supervision method, which explicitly teaches the model clinical concepts at multiple semantic levels. The clinical phrases of varying semantic levels are mined and organized for model learning. The model is first trained with disease labels to recognize underlying conditions, then with entity–relation triples to capture clinical findings, and finally fine-tuned with whole-report supervision for rapid adaptation. Throughout the training process, the same G2GenGPT model is maintained and consistently operates in generation mode. Our method aligns with the concept of curriculum learning which trains a model by arranging data in the order of increasing difficulty. In this work, we arrange the supervision signal according to its level of semantics and granularity.

Strategy III: Developping human-aligned evaluation and training

Y. Liu, Z. Wang, Y. Li, X. Liang, L. Liu, L. Wang, and L. Zhou*. “MRScore: Evaluating Radiology Report Generation with LLM-based Reward System.” MICCAI 2024

This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging LLMs. Conventional NLG metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically demonstrated by our observations. To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis. Our framework includes two key components: i) utilizing GPT to generate large amounts of training data, i.e., reports with different qualities, and ii) pairing GPT-generated reports as accepted and rejected samples and training LLMs to produce MRScore as the model reward. Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics.

Y. Liu, Y. Li, Z. Wang, X. Liang, L. Liu, L. Wang, and L. Zhou*, "ReFINE: A Reward-Based Framework for Interpretable and Nuanced Evaluation of Radiology Report Generation", AAAI 2026

In this paper, we introduce a metric called ReFINE, which not only scores reports according to user-specified criteria but also provides detailed sub-scores, enhancing interpretability and allowing users to adjust the criteria between different aspects of reports. Our reward-control loss enables this model to simultaneously output multiple individual rewards corresponding to the number of evaluation criteria, with their summation as our final ReFINE. Experiments show that ReFINE achieves stronger correlation with human ratings than traditional metrics, and generalizes well across three expert-annotated datasets, including chest X-rays and multimodal reports spanning nine imaging modalities, under two distinct scoring systems.

Strategy IV: Miscellaneous

Z. Wang, L. Liu, L. Wang, and L. Zhou*, "METransformer: Radiology Report Generation by Transformer with Multiple Expert Learners", CVPR, 2023

Attention in "multi-expert manner": In this work, we tackle the challenge of capturing fine-grained visual differences from a different perspective. Due to the intricate nature of medical images, focusing on the correct image regions during report generation is challenging. Inspired by multi-specialist consultations in clinics for challenging diagnostic cases, we introduce the METransformer to emulate the "multi-expert joint diagnosis" scenario. It introduces multiple learnable "expert" tokens into both the transformer encoder and decoder, indicated as the red tokens. In the expert transformer encoder, each expert token collaborates with others to attend to different image regions, capturing complementary information via an orthogonal loss to minimize overlap. In the expert transformer decoder, each expert token contributes to a candidate report, and the final report is determined through an expert voting strategy. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts.

Y. Liu, Y. Li, T. Chen, L. Liu, L. Wang and L. Zhou*, "SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization", CVPR, 2026

More dedicated (semantic-aligned) training process: Radiology report generators often produce fluent text yet miss crucial details, leading to local semantic conflicts or flipped findings that require stronger penalties. Cross entropy (CE) merely increases the probability of the ground-truth token y∗ without directly suppressing the model’s current wrong choice yˆ, and treats all positions uniformly, so corrections are not prioritized. We introduce a self-adaptive optimization framework that dynamically adjusts token-level gradients based on semantic discrepancy cues derived from a frozen LLM referee. Within this framework, (i) semantic conflicts between the predicted and reference reports are automatically localized and tagged and (ii) adaptive, stronger penalties are applied within these sparse but critical spans. Updates follow a push–pull scheme: error spans are pushed down, while non-error tokens are reinforced. The update strength is governed by two complementary signals—normalized entropy (for uncertainty calibration) and focal-style confidence (for handling over and under-confident predictions).

Medical Visual Question Answering (Med-VQA):

Medical Visual Question Answering (VQA) systems play a supporting role to understand clinic-relevant information carried by medical images. The questions to a medical image include two categories: close-end (such as Yes/No question) and open-end (such as WH-questions). To obtain answers, classic medical VQA methods rely on classification approaches, while more recent works attempt to use generation approaches or a mixture of the two to process the two kinds of questions separately (classification for the close-end and generation for the open-end). The classification approaches are relatively simple but perform poorly on long open-end questions, while the generation approaches face the challenge of generating many non-existent answers, resulting in low accuracy rates.

Y. Liu, Z. Wang, D. Xu, and L. Zhou*, "Q2ATransformer: Improving Medical VQA by an Answer Querying Decoder", Information Processing in Medical Imaging (IPMI), 2023 (Oral)

In this paper, we propose a new Transformer based framework for medical VQA (named as Q2ATransformer), which integrates the advantages of both the classification and the generation approaches and provides a unified treatment for the close-end and open-end questions. Specifically, we introduce an additional Transformer decoder with a set of learnable candidate answer embeddings to query the existence of each answer class to a given image-question pair. Through the Transformer attention, the candidate answer embeddings interact with the fused features of the image-question pair to make the decision. In this way, despite being a classification-based approach, our method provides a mechanism to interact with the answer information for prediction like the generation-based approaches. On the other hand, by classification, we mitigate the task difficulty by reducing the search space of answer.

Google Sites

Report abuse