Transformer is probably NOT All You Need for Needle Tracking in Ultrasound Images!
Wanwen Chen, Alex Hung, Muyu Ouyang
Transformer is probably NOT All You Need for Needle Tracking in Ultrasound Images!
Wanwen Chen, Alex Hung, Muyu Ouyang
Video Presentation
Needle tracking in ultrasound images has lots of medical applications, such as image guidance for robotic needle intervention, brachytherapy, and biopsy. However, it is hard because needles are not always visible and that ultrasound images are usually noisy. Most previous research uses classical line detectors, such as Hough Transform [4] or RANSAC [6] to detect the needle But these methods assume that the needle is clearly visible in the ultrasound images. In fact, the needle can be bent in the images, and the visibility of the needle often changes from time to time. All these factors can reduce the accuracy of the classical line detection algorithms. Recently, deep learning-based segmentation is becoming popular because of its potential to learn the segmentation in difficult situations.
Transformer [16] was first utilized in NLP, because of its ability to handle long-range dependencies in sequence to sequence task. Unlike RNNs, each word in the sequential input has direct access to other words through self-attention in Transformers, leaving no room for any information loss in back-propagation through time. Ever since Transformer was proposed, it has been the most popular and powerful framework in the NLP community [3, 1]. Recently, Transformers have been applied in the field of computer vision as well. [5] proposed Vision Transformer that beats the current state-of-the-art model in image classification. [2] came up with an end-to-end Detection Transformer that performs better than other algorithms in object detection. Besides, Transformer based networks can also achieve good results on image segmentation tasks [20]. In terms of object tracking, TrackFormer [8] unified object detection and tracking with Transformer, where the detection result from the last frame is used to guide the detection in the current frame. Inspired by the performance of Transformer, we propose to apply Transformer to ultrasound needle tracking for its capabilities in attending to certain parts of the image as well as its exceptional performance in computer vision.
Deep Learning for Needle Tracking: Most research uses CNN-based architectures in needle segmentation or localization. Mwikirize et al. [10] propose a network model with a fully convolutional network and fast R-CNN to localize the bounding box of the needle in 2D ultrasound. Zhang et al. [18] propose attention U-Net to segment multiple needles in 2D ultrasound images. Zhang et al. [19] also apply Mask R-CNN in a similar task. Mukhopadhyay et al. [9] use an encoder and FCN network to find the bounding box of the needle. But most previous research does not consider the information in time series. To the best of our knowledge, only Mwikirize et al. [11] utilizes the time-series information by using a CNN-based architecture to segment the needle tip on the difference of consecutive frames. But the information in time sequence is not fully utilized since it is only introduced in the subtraction of different frames. Another drawback is that they are only detecting the needle tip, which is an easier task than segmenting the entire needle shaft.
Transformer: In medical image analysis, there have been quite a number of works focusing on applying Transformer-based frameworks to this field. [15] outperformed other medical image segmentation networks, such as U-Net [13], U-Net++[21], and U-Net with axial attention [17] in ultrasound image segmentation. It uses gated axial attention as the basic building block while having a local branch and a global branch of the network. The local branch takes in the image patches rather than the whole image whereas the global branch takes in the entire image. The local branch is more complex with the attention modules and has more layers compared with the global branch. [12] also proposed a U-Transformer that uses self-attention and cross-attention to improve the performance of medical image segmentation. The attention module in this work leverages the feature maps from the decoder branch that have rich semantic information and the feature maps from the skip connection which have information from high-resolution layers, allowing more accurate segmentation at the boundaries. In terms of Transformer in object tracking, [8] proposed to use a Transformer framework that takes in the CNN-encoded image features and the tracking result from the previous frame to produce the result for the current frame. [14] used the objects of interest as the queries in the attention module.
Learning to track objects in computer vision is hard since the algorithm might completely fall apart if the errors accumulate in every frame. Especially in ultrasound needle tracking, needles are sometimes not visible, leading to bad tracking results on certain frames. Therefore, it is important to quantify the certainty of the tracking results in each frame and incorporate that information into the framework to avoid the accumulation of errors. We model such certainty with a separate network besides our tracking network.
In this paper, we model the tracking of needles as a temporal segmentation of needles, i.e. we consider the segmentation of the previous frame when segmenting the current frame. We perform tracking with a Transformer-based framework. Our proposed method consists of two major innovations: (1) a confidence-driven tracking framework that models the certainty of the tracking results of the previous frame, (2) to the best of our knowledge, we are the first group to use a Transformer-based method in the ultrasound needle-tracking task.
To study which module can increase the accuracy, we use an ablation study to train the model in three different modes as shown in Table 1. For each mode, we test the tracking model with U-Net, U-Transformer with full attention (U-Trans), and U-Transformer with axial attention (Axial-U-Trans). The model to calculate the confidence map is always a U-Net. In total, we have nine different models (3 different tracking models times 3 modes).
U-Net-S, U-Net-T and U-Net-C are trained using a Quadro P5000, with CUDA 10.2 and PyTorch. Other networks (Transformer-based networks) are trained on a Titan RTX, with CUDA 11 and PyTorch. For the model parameters, we have 5 downsampling layers in the encoder side of the networks, with the number of filters as 64, 128, 256, 512, and 1024 respectively. The decoder side follows the same setting in reverse order. Adam [7] optimizer is used with a learning rate of 0.0001. The dataset includes 15 trials for training, 5 trial for validation, and 7 trials for testing. In total there are 1004 images for training, 302 images for validation, and 427 images for testing.
Table 1. Ablation study: comparing different methods for needle tracking
We compare three quantitative results:
• Shaft error: we first use Otsu-threshold to threshold the output of the model, then calculate the weight average distance between the positive pixels to the ground truth pixels. The distance is defined as the smallest distance of the pixel to all the ground truth pixels.
• Tip error: we first use Otsu-threshold to threshold the output of the model. Since in the dataset, the needle is inserted from the right edge of the image, we define the needle tip as the positive pixel with the smallest x- coordinate, and the max y-coordinates on that column (that is, the bottom-left pixel in the label).
• Dice coefficient: it is twice the IoU between the ground truth label and the segmentation.
The quantitative results in Table 2 show that the U-Net tracking (U-Net-T) model works the best in shaft fitting and localize the needle tip. The U-Net tracking with confidence (U-Net-C) model works slightly better in the dice coefficient. Among the transformer families, the U-Transformer with axial attention with confidence (Axial U-Trans-C) performs the best.
From these results, we can see that the segmentation map from the previous frame improves the U-Net performance, but the confidence slightly decreases the shaft fitting and tip localization accuracy but giving a higher dice coefficient. It is possible that the Otsu thresholding does not find the best threshold to convert the prediction to a segmentation map. For all the U-Transformer models, the confidence can improve the segmentation performance since Axial U-Transformer-C and U-Transformer-C has lower shaft fitting and tip localization error. The axial attention performs better than the full attention, and the network with attention modules performance worse than U-Net. The results show that a simpler model works better in this case. It is possible that our dataset is too small that it is harder to train a good large model.
Table 2. Quantitative comparison between the nine models we trained. "S" means segmentation mode, "T" means tracking mode, and "C" means "tracking with confidence" mode. U-Net means that our segmentation model is using U-Net, Axial U-Trans means the segmentation model is using U-Transformer with axial attention, and U-Trans means that the segmentation model is using U-Transformer with full attention.
Here is a video to visualize the quality of different models on the test set.
In this project, we test different kinds of models for needle tracking in ultrasound images. Our results show that in this task, the smaller models perform better than the large models. Our results also prove that the information from the previous segmentation can improve the segmentation accuracy of the model, but the effect of the generated confidence is not clear since it has different effects on the U-Net and the Transformers. Our future work will include exploring more ideas on U-Net-based tracking models, such as introducing LSTM into the models, to preserve the time-series information better.
[1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[4] M. Ding and A. Fenster. A real-time biopsy needle segmentation technique using hough transform. Medical physics, 30(8):2222–2233, 2003.
[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[6] M. Kaya and O. Bebek. Needle localization using gabor filtering in 2d ultrasound images. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 4881– 4886. IEEE, 2014.
[7] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[8] T.Meinhardt, A.Kirillov, L.Leal-Taixe, and C.Feichtenhofer. Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702, 2021.
[9] S. Mukhopadhyay, P. Mathur, A. Bharadwaj, Y. Son, J.-S. Park, S. R. Kudavelly, S. Song, and H. Kang. Deep learning based needle tracking in prostate fusion biopsy. In Medical Imaging 2021: Image-Guided Procedures, Robotic Interventions, and Modeling, volume 11598, page 115982A. International Society for Optics and Photonics, 2021.
[10] C. Mwikirize, J. L. Nosher, and I. Hacihaliloglu. Convolution neural networks for real-time needle detection and localization in 2d ultrasound. International journal of computer assisted radiology and surgery, 13(5):647–657, 2018.
[11] C. Mwikirize, J. L. Nosher, and I. Hacihaliloglu. Single shot needle tip localization in 2d ultrasound. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 637–645. Springer, 2019.
[12] O. Petit, N. Thome, C. Rambour, and L. Soler. U-net transformer: Self and cross attention for medical image segmentation. arXiv preprint arXiv:2103.06104, 2021.
[13] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[14] P. Sun, Y. Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong, Z. Yuan, C. Wang, and P. Luo. Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020.
[15] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V. M. Patel. Medical transformer: Gated axial-attention for medical image segmentation. arXiv preprint arXiv:2102.10662, 2021.
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
[17] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In European Conference on Computer Vision, pages 108–126. Springer, 2020.
[18] Y. Zhang, Y. Lei, R. L. Qiu, T. Wang, H. Wang, A. B. Jani, W. J. Curran, P. Patel, T. Liu, and X. Yang. Multi-needle localization with attention u-net in us-guided hdr prostate brachytherapy. Medical physics, 47(7):2735–2745, 2020.
[19] Y. Zhang, Z. Tian, Y. Lei, T. Wang, P. Patel, A. B. Jani, W. J. Curran, T. Liu, and X. Yang. Automatic multi-needle localization in ultrasound images using large margin mask rcnn for ultrasound-guided prostate brachytherapy. Physics in Medicine & Biology, 65(20):205003, 2020.
[20] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840, 2020.
[21] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang. Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pages 3–11. Springer, 2018.