Tanvir Mahmud
PhD Student, The University of Texas at Austin
I am a third-year Ph. D. student in EnyAC Research at The University of Texas at Austin under the supervision of Professor Diana Marculescu. My core research focus is on computer vision, multi-modal learning, and data-efficient learning. In particular, I have strong working experience in object detection, video understanding, semantic segmentation, semi-supervised learning, long-tail learning, and multi-modal fusion. Previously, I spent two wonderful summers as research intern at Microsoft, and BOSCH Center for Articial Intelligenece (BCAI). I have completed my M. Sc. and B. Sc. in Electrical Engineering from BUET, Bangladesh. I have also served as a Lecturer at the Department of EEE in BUET.
Recent News
July 2024: PaPr is accepted in ECCV 2024! See you in Milan!
May 2024: Started Summer'24 internship at Microsoft!
April 2024: MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers got accepted to ECV'24 workshop at CVPR 2024 (oral)!
March 2024: Training-Free One-step Patch Pruning is in arxiv now (Link).
February 2024: Text-Conditional Visual Sounding Source Localization from Mixtures got accepted in CVPR-2024.
January 2024: Weakly supervised language-conditional audio separation got accepted in ICLR-2024 (Paper).
October 2023: Instance Aware Repeat Factor Sampling got accepted in NeurIPS Workshop on Heavy Tails in ML, 2023 (Paper).
October 2023: SSVOD got accepted in WACV 2024 (Paper).
May 2023: Joined as a Research Scientist Intern at Microsoft Research.
February 2023: One paper (Clip4VideoCap) got accepted in ICASSP 2023 (Paper).
January 2023: I am going to join Microsoft Research for Summer '23 Internship.
August 2022: One paper (AVE-CLIP ) got accepted in WACV 2023 (Paper).
July 2022: One paper presented in ICML 2022 Workshop on Pre-training (Paper).
May 2022: I will join as a summer intern at BCAI, Sunnyvale, CA.
September 2021: One paper got published in IEEE Transactions on Industrial Informatics (Paper).
August 2021: Started my Ph. D. at the University of Texas at Austin.
July 2021: One paper got published in IEEE ACCESS (Paper).
July 2021: Received Cockrell School of Engineering Fellowship at UT Austin.
June 2021: Successfully defended my M. Sc. thesis at BUET (Thesis).
June 2021: One paper got published in IEEE Transactions on Artificial Intelligence (Paper).
April 2020: I will join as a Lecturer at the Dept. of EEE, BUET.
Experience
Research Scientist Intern (May 2023 - August 2023)
Microsoft Research, Redmond, WA, 98052, USA
Supervisor : Kazuhito Koushida
My job duties and responsibilities will include:
Proposed a weakly-supervised audio separation framework from multi-source audio mixtures for music and audio processing applications.
Achieved 97.5% of single-source training performance while training with audio mixtures only.
Submitted a paper in ICLR-2024.
Deep Learning Research Intern - Computer Vision (May 2022 - August 2022)
BOSCH Center for Artificial Intelligence, Sunnyvale, CA, USA
Supervisor : Chun-Hao Liu
Proposed a novel semi-supervised video object detection framework for autonomous driving applications that can achieve around 98% of supervised performance obtained with 15 frame annotations per video by simply adopting 1 frame annotation per video. Filed a patent and the manuscript is published in W.
Worked on long-tail learning for handling data imbalance in autonomous driving applications. Proposed strategic approaches to deal with large class imbalance by multiphase training. Filed a patent.
Graduate Research Assistant (August 2021 - Present)
Electrical and Computer Engineering Department, The University of Texas at Austin, Austin, USA
Supervisor : Diana Marculescu
Developed a semi-supervised video object detection framework to leverage unlabeled frames with state-of-the-art detectors.
Developed a framework for audio-visual event localization based on AudioClip and multi-window transformer fusion.
Worked on reinforcement learning-based fine-tuning for effective cross-domain knowledge transfer.
Currently working on semi-supervised long-tail recognition for critical object detection.
Publications
SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations (Paper)
Tanvir Mahmud, Chun-Hao Liu, Burhaneddin Yaman, and Diana Marculescu
IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, Hawai, 2024
Despite significant progress in semi-supervised learning for image object detection, several key issues are yet to be addressed for video object detection: (1) Achieving good performance for supervised video object detection greatly depends on the availability of annotated frames. (2) Despite having large inter-frame correlations in a video, collecting annotations for a large number of frames per video is expensive, time-consuming, and often redundant. (3) Existing semi-supervised techniques on static images can hardly exploit the temporal motion dynamics inherently present in videos. In this paper, we introduce SSVOD, an end-to-end semi-supervised video object detection framework that exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations. To selectively assemble robust pseudo-labels across groups of frames, we introduce \textit{flow-warped predictions} from nearby frames for temporal-consistency estimation. In particular, we introduce cross-IoU and cross-divergence based selection methods over a set of estimated predictions to include robust pseudo-labels for bounding boxes and class labels, respectively. To strike a balance between confirmation bias and uncertainty noise in pseudo-labels, we propose confidence threshold based combination of hard and soft pseudo-labels. Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS datasets.
CLIP4VIDEOCAP: RETHINKING CLIP FOR VIDEO CAPTIONING WITH MULTISCALE TEMPORAL FUSION AND COMMONSENSE KNOWLEDGE (Paper)
Tanvir Mahmud, Feng Liang, Yaling Qing, and Diana Marculescu
2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhoades, Greece, 2023
In this paper, we propose CLIP4VideoCap for video captioning based on large-scale pre-trained CLIP image and text encoders together with multi-scale temporal reasoning and commonsense knowledge. In addition to the CLIP-image encoder operating on successive video frames, we introduce a knowledge distillation-based learning scheme that aims to exploit the CLIP-text encoder to generate rich textual knowledge from the image features. For improved temporal reasoning over the video, we propose a multi-scale temporal fusion scheme that accumulates temporal features from different temporal windows. In addition, we integrate various commonsense aspects in the caption generation which greatly enhances the caption quality by extracting the commonsense features from the video in the intermediate phase. Combining these strategies, we achieve state-of-the-art performance on the benchmark MSR-VTT dataset confirming that our framework significantly outperforms existing approaches.
CLIP4VIDEOCAP: RETHINKING CLIP FOR VIDEO CAPTIONING WITH MULTISCALE TEMPORAL FUSION AND COMMONSENSE KNOWLEDGE (Paper)
Tanvir Mahmud, Feng Liang, Yaling Qing, and Diana Marculescu
2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhoades, Greece, 2023
In this paper, we propose CLIP4VideoCap for video captioning based on large-scale pre-trained CLIP image and text encoders together with multi-scale temporal reasoning and commonsense knowledge. In addition to the CLIP-image encoder operating on successive video frames, we introduce a knowledge distillation-based learning scheme that aims to exploit the CLIP-text encoder to generate rich textual knowledge from the image features. For improved temporal reasoning over the video, we propose a multi-scale temporal fusion scheme that accumulates temporal features from different temporal windows. In addition, we integrate various commonsense aspects in the caption generation which greatly enhances the caption quality by extracting the commonsense features from the video in the intermediate phase. Combining these strategies, we achieve state-of-the-art performance on the benchmark MSR-VTT dataset confirming that our framework significantly outperforms existing approaches.
AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization (Paper)
Tanvir Mahmud, and Diana Marculescu
IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, Hawai, 2023
We introduce AVE-CLIP to exploit AudioCLIP pre-trained on large-scale audio-image pairs for improving inter-modal feature correspondence on video AVEs. We propose a multi-window temporal transformer based fusion scheme that operates on different timescales of AVE frames to extract local and global variations of multi-modal features. We introduce a temporal feature refinement scheme to increase contrast with the background. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement which proves its superiority over existing approaches.
RL-Tune: A Deep Reinforcement Learning Assisted Layer-wise Fine-Tuning Approach for Transfer Learning (Paper)
Tanvir Mahmud, Natalia A. Frumkin, and Diana Marculescu
International Conferernce on Machine Learning (ICML) Pretraining Workshop, Baltimore, Maryland, 2022
We address two critical challenges with transfer learning via fine-tuning: (1) The required amount of fine-tuning greatly depends on the distribution shift from source to target dataset. (2) This distribution shift greatly varies by layer, thereby requiring layer-wise adjustments in fine-tuning to adapt to this distribution shift while preserving the pre-trained network’s feature representation. To overcome these challenges, we propose RL-Tune, a layer-wise fine-tuning framework for transfer learning which leverages reinforcement learning to adjust learning rates as a function of the target data shift. RL-Tune outperforms other state-of-the-art approaches on standard transfer learning benchmarks by a large margin, e.g., 6% mean accuracy improvement on CUB-200-2011 with 15% data.
CovTANet: A Hybrid Tri-level Attention Based Network for Lesion Segmentation, Diagnosis, and Severity Prediction of COVID-19 Chest CT Scans (Paper)
Tanvir Mahmud, Md. Jahin Alam, Sakib Chowdhury, Shams Nafisa Ali, Md. Maisoon Rahman, Shaikh Anowarul Fattah, Mohammad Saquib
IEEE Transactions on Industrial Informatics, Vol. 17, Issue 9, Sep 2021
We propose a hybrid neural network is proposed, named CovTANet, to provide an end-to-end clinical diagnostic tool for early diagnosis, lesion segmentation, and severity prediction of COVID-19 utilizing chest computer tomography (CT) scans. A novel tri-level attention mechanism has been introduced, which is repeatedly utilized over the network, combining channel, spatial, and pixel attention schemes for faster and efficient generalization of contextual information embedded in the feature map through feature recalibration and enhancement operations. Outstanding performances have been achieved in all three tasks on a large publicly available dataset containing 1110 chest CT-volumes.
Sleep Apnea Detection From Variational Mode Decomposed EEG Signal Using a Hybrid CNN-BiLSTM (Paper)
Tanvir Mahmud, Ishtiaque Ahmed Khan, Talha Ibn Mahmud, Shaikh Anowarul Fattah, Wei-Ping Zhu, and M. Omair Ahmad
IEEE ACCESS, Vol. 9, July 2021
We propose an automated deep learning-based approach is proposed for the detection of sleep apnea frames from electroencephalogram (EEG) signals. Unlike conventional methods of direct feature extraction from EEG signals, the variational mode decomposition (VMD) algorithm is utilized in the proposed method to decompose the EEG signals into a number of modes. Afterward, a fully convolutional neural network (FCNN) is proposed to separately extract the temporal features from each VMD mode in parallel while maintaining their temporal dependencies. This study is carried out in a subject independent manner where separate subjects are considered for training and testing. Extensive experimentations on three publicly available datasets provide average accuracy of 93.22%, 93.25% and 89.41% in the subject-independent cross-validation scheme.
CovSegNet: A Multi Encoder–Decoder Architecture for Improved Lesion Segmentation of COVID-19 Chest CT Scans (Paper)
Tanvir Mahmud, Md Awsafur Rahman, and Shaikh Anowarul Fattah
IEEE Transactions on Artificial Intelligence, Vol. 2, June 2021
We propose an efficient two-phase training scheme is introduced where a deeper 2-D network is employed for generating region-of-interest (ROI)-enhanced CT volume followed by a shallower 3D network. Along with the traditional vertical expansion of Unet, we have introduced horizontal expansion with multistage encoder–decoder modules for achieving optimum performance. Additionally, multiscale feature maps are integrated into the scale transition process to overcome the loss of contextual information. Outstanding performances have been achieved in three publicly available datasets.
A Novel Multi-Stage Training Approach for Human Activity Recognition From Multimodal Wearable Sensor Data Using Deep Neural Network (Paper)
Tanvir Mahmud, A. Q. M. Sazzad Sayyed, Shaikh Anowarul Fattah, and Sun-Yuan Kung
IEEE Sensors Journal, Vol. 21, Issue 2, Jan 2021
We propose a novel multi-stage training approach that increases diversity in the feature extraction process to make accurate recognition of actions by combining varieties of features extracted from diverse perspectives. Initially, instead of using single type of transformation, numerous transformations are employed on time series data to obtain variegated representations of the features encoded in raw data. We achieve state-of-the-art accuracy of 99.29% on UCI HAR database, 99.02% on USC HAR database, and 97.21% on SKODA database.
CovXNet: A multi-dilation convolutional neural network for automatic COVID-19 and other pneumonia detection from chest X-ray images with transferable multi-receptive feature optimization (Paper/Code)
Tanvir Mahmud, Md Awsafur Rahman, and Shaikh Anowarul Fattah
Computers in Biology and Medicine, July 2020
We propose a deep learning aided automated COVID-19 and other pneumonia detection schemes utilizing a small amount of COVID-19 chest X-rays. A deep convolutional neural network (CNN) based architecture, named as CovXNet, is proposed that utilizes depthwise convolution with varying dilation rates for efficiently extracting diversified features from chest X-rays. Learning of the initial training phase is transferred with some additional fine-tuning layers that are further trained with a smaller number of COVID-19 chest X-rays. We achieve very satisfactory detection performance with accuracy of 97.4% accuracy.
DeepArrNet: An Efficient Deep CNN Architecture for Automatic Arrhythmia Detection and Classification From Denoised ECG Beats (Paper)
Tanvir Mahmud, Shaikh Anowarul Fattah, and Mohammad Saquib
IEEE ACCESS, June 2020
We propose an efficient deep convolutional neural network (CNN) architecture based on depthwise temporal convolution along with a robust end-to-end scheme to automatically detect and classify arrhythmia from denoised electrocardiogram (ECG) signal. A structural unit, namely PTP (Pontwise-Temporal-Pointwise Convolution) unit, is designed with its variants where depthwise temporal convolutions with varying kernel sizes are incorporated along with prior and post pointwise convolution. We achieve state-of-the-art performances on two publicly available datasets in all traditional evaluation metrics.
Education
PhD in Electrical & Computer Engineering, Aug 2021 - June 2025 (expected).
Masters in Communication & Electronic Engineering, Oct 2018 - June 2021.
Bachelor of Science in Electrical & Electronic Engineering, Oct 2018.
Academic Service
Reviewer: WACV 2024, IEEE Sensors Journal, IEEE ACCESS, Computers in Biology and Medicine, IEEE Journal of Translational Engineering in Health and Medicine
Member, IEEE Electron Device Society and IEEE Circuit and System Society, Bangladesh Joint Chapter (from January 2020 - June 2021)
Lecturer, Department of EEE, BUET, Bangladesh (April 2018 - June 2021, On study leave)
Contact
Mobile: +1 512 - 720 - 2450
Email: tanvirmahmud@utexas.edu
Address: 3357 Lake Austin, 78703, Austin, TX, USA