Towards Reliable Medical Report Generation
Radiology report generation (R2Gen) seeks to automatically translate medical images into clinically meaningful reports. While recent models can produce fluent text, they often struggle to faithfully reflect image evidence, leading to hallucinated or unreliable statements. Our work addresses this challenge through systematic works that improve how models see, reason, and are evaluated. We strengthen vision–language alignment to better ground reports in images, incorporate structured clinical knowledge to guide medically plausible generation, and develop human-aligned evaluation methods that reflect how radiologists judge report quality. In addition, inspired by multi-specialist clinical practice, we introduce the METransformer, which enables the model to attend to complementary image regions through coordinated expert reasoning.
Strategy I: Bridging Images and Reports for Reliable Radiology Report Generation
Z. Wang, L. Zhou*, L. Wang, and X. Li, "A Self-boosting Framework for Automated Radiographic Report Generation", CVPR, 2021
In this work, we proposed a self-boosting framework learning highly correlated image and text features so that even finer visual changes could be narrated in the generated reports. This is achieved through explicitly aligning image and text features through an auxiliary task of image-text matching (ITM). ITM and report generation (RG) are built as two branches of a deep learning model and jointly trained through our proposed self-boosted triple loss to boost mutual performance. The ITM branch learns strongly correlated visual and text features, and the text-correlated visual features are passed to the RG branch to help it generate high-quality reports. In tur, the reports with improved quality from the RG branch are passed to the ITM branch as harder samples. This enforces the ITM branch to keep enhancing its feature learning so that even finer mismatch between the image and the generated report could be identified. The three interactions last for the whole training procedure and make the model to gradually improve itself for promoting the ultimate report generation.
Z. Wang, L. Wang, X. Li, and L. Zhou*, "Diagnostic Captioning by Cooperative Task Interactions and Sample-graph Consistency”, IEEE TPAMI, 2025
We further extended our self-boosting framework by introducing a retrieval-based strategy aligning the image-sample space and the report-sample space to achieve consistent image and text feature embeddings. To achieve this, both the image sample-graph and the report sample-graph are built. By requiring the two graphs to have the same structure, the sample-graph of the embedded ground-truth reports could be used as the target to train the sample-graph of the embedded images. In this way, two similar but different ground-truth reports will correspond to two close but different visual embeddings. This strategy is implemented as the sample-graph consistency loss imposed at batch-level, and integrated into our self-boosting framework as additional regularization.
B. P. Voutharoja, L. Wang, and L. Zhou, "Automatic Radiology Report Generation by Learning with Increasingly Hard Negatives", ECAI 2023 (Oral)
In this work, we introduce a framework that learns discriminative image–report features by contrasting them with hard negatives. To enhance feature discrimination, the difficulty of negatives is progressively increased during training. We formulate this as a min–max alternating optimization: given current hard negatives, features are updated by minimizing report-generation losses, after which harder negatives are generated by maximizing an maximising a loss reflecting image-report alignment. This iterative process yields a model capable of producing more specific and accurate reports.
Strategy II: Embedding Clinical Knowledge to Enhance Report Trustworthiness
Z. Wang, H. Han, L. Wang, X. Li, and L. Zhou*, "Automated Radiographic Report Generation Purely on Transformer: A Multi-criteria Supervised Approach", IEEE Trans on Medical Imaging, 2022
In this work, we leverage the transformer model's ability to capture long-range dependencies in both image regions and sentence words for R2Gen. Our model adopts a pure transformer architecture, utilizing the vision transformer as the image encoder and a memory-driven transformer as the report decoder for effective long-text generation. Training involves multi-criteria supervision, incorporating three loss terms for visual-textual alignment, multi-disease classification to capture disease-related features, and word-importance weighting to capture less frequent but critical key words in the report, thereby improving overall report generation.
Z. Wang, M. Tang, L. Wang, X. Li, and L. Zhou*, "A Medical Semantic-Assisted Transformer for Radiographic Report Generation", MICCAI, 2022
In this paper, we replaced the image-text matching loss with the pretrainded CLIP model and proposed a memory-augmented sparse attention block utilizing bilinear pooling to capture the higher-order interactions between the input fine-grained image features while producing sparse attention. Moreover, we introduce a novel Medical Concepts Generation Network (MCGN) to predict fine-grained semantic concepts and incorporate them into the report generation process as guidance. For medical tag prediction, we proposed to use Radgraph to extract 768 medical concepts, which are fine-grained tags than the conventional dozens of medical tags used in the existing R2Gen models.
Y. Li, Z. Wang, Y. Liu, L. Liu, L. Wang, and L. Zhou. “KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models” MICCAI, 2024 (Early Accept)
Despite the wealth of knowledge within LLMs, efficiently triggering relevant knowledge within these large models for specific tasks like R2Gen poses a critical research challenge. This paper presents KARGEN, a Knowledge-enhanced Automated radiology Report GENeration framework based on LLMs. Utilizing a frozen LLM to generate reports, the framework integrates a knowledge graph to unlock chest disease-related knowledge within the LLM to enhance the clinical utility of generated reports. The knowledge graph is utilized to distill disease-related features, which are then fused with the image regional features via Mixture of Experts. The fused features attend to both disease and normal information and are used to better prompt LLM for report generation.
ZL. Chen, Y. Li, Z. Wang, P. Gao, J. Barthelemy, L. Zhou, and L. Wang, "Enhancing Radiology Report Generation via Multi-Phased Supervision", IEEE Transactions on Medical Imaging (IEEE-TMI), 2025
In this work, we proposed a multi-phase supervision method, which explicitly teaches the model clinical concepts at multiple semantic levels. The clinical phrases of varying semantic levels are mined and organized for model learning. The model is first trained with disease labels to recognize underlying conditions, then with entity–relation triples to capture clinical findings, and finally fine-tuned with whole-report supervision for rapid adaptation. Throughout the training process, the same G2GenGPT model is maintained and consistently operates in generation mode. Our method aligns with the concept of curriculum learning which trains a model by arranging data in the order of increasing difficulty. In this work, we arrange the supervision signal according to its level of semantics and granularity.
Strategy III: Developping human-aligned evaluation and training
Y. Liu, Z. Wang, Y. Li, X. Liang, L. Liu, L. Wang, and L. Zhou*. “MRScore: Evaluating Radiology Report Generation with LLM-based Reward System.” MICCAI 2024
This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging LLMs. Conventional NLG metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically demonstrated by our observations. To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis. Our framework includes two key components: i) utilizing GPT to generate large amounts of training data, i.e., reports with different qualities, and ii) pairing GPT-generated reports as accepted and rejected samples and training LLMs to produce MRScore as the model reward. Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics.
Y. Liu, Y. Li, Z. Wang, X. Liang, L. Liu, L. Wang, and L. Zhou*, "ReFINE: A Reward-Based Framework for Interpretable and Nuanced Evaluation of Radiology Report Generation", AAAI 2026
In this paper, we introduce a metric called ReFINE, which not only scores reports according to user-specified criteria but also provides detailed sub-scores, enhancing interpretability and allowing users to adjust the criteria between different aspects of reports. Our reward-control loss enables this model to simultaneously output multiple individual rewards corresponding to the number of evaluation criteria, with their summation as our final ReFINE. Experiments show that ReFINE achieves stronger correlation with human ratings than traditional metrics, and generalizes well across three expert-annotated datasets, including chest X-rays and multimodal reports spanning nine imaging modalities, under two distinct scoring systems.
Strategy IV: Miscellaneous
Z. Wang, L. Liu, L. Wang, and L. Zhou*, "METransformer: Radiology Report Generation by Transformer with Multiple Expert Learners", CVPR, 2023
In this work, we tackle the challenge of capturing fine-grained visual differences from a different perspective. Due to the intricate nature of medical images, focusing on the correct image regions during report generation is challenging. Inspired by multi-specialist consultations in clinics for challenging diagnostic cases, we introduce the METransformer to emulate the "multi-expert joint diagnosis" scenario. It introduces multiple learnable "expert" tokens into both the transformer encoder and decoder, indicated as the red tokens. In the expert transformer encoder, each expert token collaborates with others to attend to different image regions, capturing complementary information via an orthogonal loss to minimize overlap. In the expert transformer decoder, each expert token contributes to a candidate report, and the final report is determined through an expert voting strategy. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts.