In Medical Domain, the process of interpreting and captioning medical images needs high cost and time, along with the expertise support. With the increasing number of medical images, it is hard for the radiologists to manage their work load alone. Automating medical image captioning solves the problem of high cost and time spent while helping the radiologists to improve the reliability and the accuracy of the generated captions. Even without any expertise support, the accuracy of the captions can be increased. This is a very good opportunity for the new radiologists with less experience. Although prior work has focused on automating medical image captioning, there are still some issues that have not been addressed yet such as generating long captions with unnecessary details, being unable to identify abnormal regions in complex images and the low accuracy and unreliability of some generated captions. We hope to develop a Deep Learning model for captioning the images related to the medical sector with the ability to extract the features from images and output meaningful sentences related to the defects identified from the image with high accuracy. Proposed model will be trained on an Amazon EC2 instance using the AWS Deep Learning Containers. In future, our proposed model can be used to help the clinicians as a second option, to increase the confidence of the diagnosis.
The Chest X-Ray images that are extensively used to identify symptoms, signs of injury and diseases are usually read by well-trained experts such as radiologists and physicians. However, with the increasing availability of Chest X-Ray images, radiologists and other physicians face difficulties involving themselves alone in Chest X-Ray captioning. The huge time taken for the task and the unreliable captions generated by inexperienced new physicians have become a bottleneck in the medical diagnostic and treatment pipeline. With that, the need for an effective and efficient method of captioning medical images has risen. Although automated Chest X-Ray image captioning is not a new concept in the medical sector, it is not popular as a trustworthy solution yet. Researchers have invented different methodologies; however, still, there are multiple issues in the automation of the Chest X-Ray image captioning process to obtain results similar to human-generated captions.
In this work, a Chest X-Ray Captioning model is implemented using CheXNet, FasterRCNN and a memory-driven Transformer. CheXNet is used for extracting features from the Chest X-Ray images, while FasterRCNN is used to create objects from those features. Then those objects are fed into the memory-driven transformer to generate captions. As novel contributions, we apply the contrastive learning technique to CheXNet to improve the feature extraction and use image objects in the memory-driven transformer instead of the features to improve caption generation.