Eformer: Edge Enhancement based Transformer for Medical Image Denoising (2021)
Authors - Achleshwar Luthra*, Harsh Sulakhe*, Tanish Mittal*, Abhishek Iyer, and Santosh Yadav
In Computer Vision for Automated Medical Diagnosis, ICCV'21 Workshop [accepted (abstract)]
Abstract: A novel architecture that builds an encoder-decoder network using transformer blocks for medical image denoising. Non-overlapping window-based self-attention is used in the transformer block that reduces computational requirements. This work further incorporates learnable Sobel-Feldman operators to enhance edges in the image and propose an effective way to concatenate them in the intermediate layers of our architecture. The experimental analysis is conducted by comparing deterministic learning and residual learning for the task of medical image denoising. To defend the effectiveness of our approach, our model is evaluated on the AAPM-Mayo Clinic Low-Dose CT Grand Challenge Dataset and achieves state-of-the-art performance, i.e., 43.487 PSNR, 0.0067 RMSE, and 0.9861 SSIM. We believe that our work will encourage more research in transformer-based architectures for medical image denoising using residual learning.
MedSkip: Medical Report Generation Using Skip Connections and Integrated Attention (2021)
Authors - Esha Pahwa, Dwij, Devansh, Sanjeet, and Achleshwar.
In Computer Vision for Automated Medical Diagnosis, ICCV'21 Workshop [accepted (proceedings)]
Abstract: Medical scans are extremely important for accurate diagnosis and treatment. To assist staff members in such crucial tasks, developing a computer vision model that efficiently processes a medical image and results in a generated report can be highly beneficial. Such a robust system can not only act as a helping hand for professionals but also eliminate the chances of error that might arise in the case of in-experienced staff members. However, previous studies lack focus on experimenting with the visual extractor, which is of eminent importance. Keeping this in mind, we propose a novel architecture of a modified HRNet which includes added skip connections along with convolutional block attention modules (CBAM). The entire architecture can be divided into two components, the first being the visual extractor where the pre-processed image is fed into the HRNet convolutional layers. Outputs of each down-sampled layer are concatenated after passing through the attention modules. The second component includes the use of a memory-driven transformer that generates the report. We evaluate our model on two publicly available datasets, PEIR Gross and IU X-Ray, establishing new state-of-the-art for PEIR Gross while giving competitive results for IU X-Ray.
SWTA- Sparse Weighted Temporal Attention for Drone Based Activity Recognition (2021)
Authors - Achleshwar Luthra*, Esha Pahwa*, Santosh Yadav*, Kamlesh Tiwari, Soumendu Sinha, Rajanie Prabha, and Hari Mohan Pandey.
In Neural Networks Journal, 2022. [accepted]
Abstract: Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community in the past few years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention. The proposed SWTA is divided into two components. First, temporal segment network that sparsely samples a given set of frames. Second, weighted temporal attention incorporates a fusion of attention maps derived from optical flow, with raw RGB images. This is followed by basenet network, which comprises a convolutional neural network (CNN) module along with fully connected layers that provide us with activity recognition. The SWTA network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama-Action, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the art performances by a margin of 25.26%, 18.56%, and 2.94%, respectively.
Conditional RGB-T Fusion for Effective Crowd Counting (2022)
Authors - Esha Pahwa*, Sanjeet Kapadia*, Achleshwar Luthra*, Shreyas Sheeranali*.
In IEEE ICIP 2022. [accepted]
Abstract: A robust and efficient crowd counting framework forms a vital step towards the analysis of a crowded scene and finds its applications in social distance monitoring, traffic management, and video surveillance. We argue that visible RGB images fail to yield high-quality density maps owing to poor lighting conditions during the dark when an anomaly is more likely to occur. To tackle this scenario, we introduce a novel architecture Toggle-Fusion Network (TFNet) that effectively utilises a multimodal dataset, RGBT-CC, containing pairs of thermal and RGB images. Our approach eliminates the need for two branches for each of these modalities as proposed in the baseline by conditionally fusing the thermal and RGB images with the help of a toggle, saliency maps, and a rolling guidance filter. We conduct extensive experiments and delineate the importance of individual components of our method in the ablation study. TFNet, after evaluation on RGBT-CC, receives an RMSE value of 6.11 thereby establishing new state-of-the-art for the dataset. The code is made publicly available https://github.com/ShreyasSR/TFNet
NFResNet: Multi-scale and U-shaped Networks for Deblurring (2023)
Authors - Tanish Mittal*, Preyansh Agrawal*, Esha Pahwa*.
In IEEE Transactions on Image Processing 2023. [under review]
Abstract: Multi-Scale and U-shaped Networks are widely used in various image restoration problems, including deblurring. Keeping in mind the wide range of applications, we present a comparison of these architectures and their effects on image deblurring. We also introduce a new block called as NFResblock. It consists of a Fast Fourier Transformation layer and a series of modified Non-Linear Activation Free Blocks. Based on these architectures and additions, we introduce NFResnet and NFResnet+, which are modified multi-scale and U-Net architectures, respectively. We also use three different loss functions to train these architectures: Charbonnier Loss, Edge Loss, and Frequency Reconstruction Loss. Extensive experiments on the Deep Video Deblurring dataset, along with ablation studies for each component, have been presented in this paper. The proposed architectures achieve a considerable increase in Peak Signal to Noise (PSNR) ratio and Structural Similarity Index (SSIM) value.