RepsNet: Combining Vision with Language for Automated Medical Reports
Verily & Google Research, USA

[paper] [bibtex] [demo] [poster] [talk]


Writing reports by analyzing medical images is error-prone for inexperienced practitioners and time consuming for experienced ones. In this work, we present RepsNet that adapts pre-trained vision and language models to interpret medical images and generate automated reports in natural language. RepsNet consists of an encoder-decoder model: the encoder aligns the images with natural language descriptions via contrastive learning, while the decoder predicts answers by conditioning on encoded images and prior context of descriptions retrieved by nearest neighbor search. We formulate the problem in a visual question answering setting to handle both categorical and descriptive natural language answers. We perform experiments on two challenging tasks of medical visual question answering (VQA-Rad) and report generation (IU-Xray) on radiology image datasets. Results show that RepsNet outperforms state-of-the-art methods with 81.08 % classification accuracy on VQA-Rad 2018 and 0.58 BLEU-1 score on IU-Xray.

RepsNet fills Medical Reports by Classification or Text Generation via Visual Question Answering

RepsNet analyzes medical images and automates report writing by providing answers to questions via classifying among known answer categories or generating natural language descriptions. See radiology report generation example with top two categorical answers and bottom one natural language descriptive answer.

RepsNet encoded image and question features are fused via bilinear attention network (BAN), before self-supervised contrastive alignment with natural language descriptions. The answer is categorized via classification among fixed answer categories or generated by conditional language decoding on image, question and prior context of answers retrieved by nearest neighbour search.


Heatmap visualization and comparison between ground-truth (GT) and RepsNet (RN) generated report of: (left) normal case, (right) abnormal case. RepsNet shows strong alignment with ground-truth in describing medical findings. Text in blue shows abnormalities, text in red represents misalignment.

Contact Us

Authors: Ajay Tanwani, Joelle Barral, Daniel Freedman

For more information, please reach out to Ajay Tanwani:

Acknowledgment: We thank our partners from Verily Life Sciences and Google Research, in particular Scott Fleming, Bryce Evans, Simon Schlachter, Roman Goldenberg, Saeed Latif, Ella Stiematzky, Yizhaq Shmayahu and Ehud Rivlin for feedback and suggestions.