Tanvir Mahmud 

PhD Student, The University of Texas at Austin

I am a third-year Ph. D. student in EnyAC Research at The University of Texas at Austin under the supervision of Professor Diana Marculescu. My core research focus is on computer vision, multi-modal learning, and data-efficient learning. In particular, I have strong working experience in object detection, video understanding, semantic segmentation, semi-supervised learning, long-tail learning, and multi-modal fusion. Previously, I spent two wonderful summers as research intern at Microsoft, and BOSCH Center for Articial Intelligenece (BCAI). I have completed my M. Sc. and B. Sc. in Electrical Engineering from BUET, Bangladesh. I have also served as a Lecturer at the Department of EEE in BUET.

Recent News

Experience

Research Scientist Intern (May 2023 - August 2023)

Microsoft Research, Redmond, WA, 98052, USA

Supervisor : Kazuhito Koushida

My job duties and responsibilities will include:

Deep Learning Research Intern - Computer Vision (May 2022 - August 2022)

BOSCH Center for Artificial Intelligence, Sunnyvale, CA, USA

Supervisor  : Chun-Hao Liu

Graduate Research Assistant (August 2021 - Present)

Electrical and Computer Engineering Department,  The University of Texas at Austin, Austin, USA

Supervisor : Diana Marculescu

Publications

SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations (Paper)

Tanvir Mahmud, Chun-Hao Liu, Burhaneddin Yaman, and Diana Marculescu

IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, Hawai, 2024

Despite significant progress in semi-supervised learning for image object detection, several key issues are yet to be addressed for video object detection: (1) Achieving good performance for supervised video object detection greatly depends on the availability of annotated frames. (2) Despite having large inter-frame correlations in a video, collecting annotations for a large number of frames per video is expensive, time-consuming, and often redundant. (3) Existing semi-supervised techniques on static images can hardly exploit the temporal motion dynamics inherently present in videos. In this paper, we introduce SSVOD, an end-to-end semi-supervised video object detection framework that exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations. To selectively assemble robust pseudo-labels across groups of frames, we introduce \textit{flow-warped predictions} from nearby frames for temporal-consistency estimation. In particular, we introduce cross-IoU and cross-divergence based selection methods over a set of estimated predictions to include robust pseudo-labels for bounding boxes and class labels, respectively. To strike a balance between confirmation bias and uncertainty noise in pseudo-labels, we propose confidence threshold based combination of hard and soft pseudo-labels. Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS datasets. 

CLIP4VIDEOCAP: RETHINKING CLIP FOR VIDEO CAPTIONING WITH MULTISCALE TEMPORAL FUSION AND COMMONSENSE KNOWLEDGE (Paper)

Tanvir Mahmud, Feng Liang, Yaling Qing, and Diana Marculescu

2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhoades, Greece, 2023

In this paper, we propose CLIP4VideoCap for video captioning based on large-scale pre-trained CLIP image and text encoders together with multi-scale temporal reasoning and commonsense knowledge. In addition to the CLIP-image encoder operating on successive video frames, we introduce a knowledge distillation-based learning scheme that aims to exploit the CLIP-text encoder to generate rich textual knowledge from the image features. For improved temporal reasoning over the video, we propose a multi-scale temporal fusion scheme that accumulates temporal features from different temporal windows. In addition, we integrate various commonsense aspects in the caption generation which greatly enhances the caption quality by extracting the commonsense features from the video in the intermediate phase. Combining these strategies, we achieve state-of-the-art performance on the benchmark MSR-VTT dataset confirming that our framework significantly outperforms existing approaches.

CLIP4VIDEOCAP: RETHINKING CLIP FOR VIDEO CAPTIONING WITH MULTISCALE TEMPORAL FUSION AND COMMONSENSE KNOWLEDGE (Paper)

Tanvir Mahmud, Feng Liang, Yaling Qing, and Diana Marculescu

2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhoades, Greece, 2023

In this paper, we propose CLIP4VideoCap for video captioning based on large-scale pre-trained CLIP image and text encoders together with multi-scale temporal reasoning and commonsense knowledge. In addition to the CLIP-image encoder operating on successive video frames, we introduce a knowledge distillation-based learning scheme that aims to exploit the CLIP-text encoder to generate rich textual knowledge from the image features. For improved temporal reasoning over the video, we propose a multi-scale temporal fusion scheme that accumulates temporal features from different temporal windows. In addition, we integrate various commonsense aspects in the caption generation which greatly enhances the caption quality by extracting the commonsense features from the video in the intermediate phase. Combining these strategies, we achieve state-of-the-art performance on the benchmark MSR-VTT dataset confirming that our framework significantly outperforms existing approaches.

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization (Paper)

Tanvir Mahmud, and Diana Marculescu

IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, Hawai, 2023

We introduce AVE-CLIP to exploit AudioCLIP pre-trained on large-scale audio-image pairs for improving inter-modal feature correspondence on video AVEs. We propose a multi-window temporal transformer based fusion scheme that operates on different timescales of AVE frames to extract local and global variations of multi-modal features. We introduce a temporal feature refinement scheme to increase contrast with the background. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement which proves its superiority over existing approaches.

RL-Tune: A Deep Reinforcement Learning Assisted Layer-wise Fine-Tuning Approach for Transfer Learning (Paper)

Tanvir Mahmud, Natalia A. Frumkin, and Diana Marculescu

International Conferernce on Machine Learning (ICML) Pretraining Workshop, Baltimore, Maryland, 2022

We address two critical challenges with transfer learning via fine-tuning: (1) The required amount of fine-tuning greatly depends on the distribution shift from source to target dataset. (2) This distribution shift greatly varies by layer, thereby requiring layer-wise adjustments in fine-tuning to adapt to this distribution shift while preserving the pre-trained network’s feature representation. To overcome these challenges, we propose RL-Tune, a layer-wise fine-tuning framework for transfer learning which leverages reinforcement learning to adjust learning rates as a function of the target data shift. RL-Tune outperforms other state-of-the-art approaches on standard transfer learning benchmarks by a large margin, e.g., 6% mean accuracy improvement on CUB-200-2011 with 15% data.

CovTANet: A Hybrid Tri-level Attention Based Network for Lesion Segmentation, Diagnosis, and Severity Prediction of COVID-19 Chest CT Scans  (Paper)

Tanvir Mahmud, Md. Jahin Alam, Sakib Chowdhury, Shams Nafisa Ali, Md. Maisoon Rahman, Shaikh Anowarul Fattah, Mohammad Saquib

IEEE Transactions on Industrial Informatics, Vol. 17, Issue 9, Sep 2021

We propose a hybrid neural network is proposed, named CovTANet, to provide an end-to-end clinical diagnostic tool for early diagnosis, lesion segmentation, and severity prediction of COVID-19 utilizing chest computer tomography (CT) scans.  A novel tri-level attention mechanism has been introduced, which is repeatedly utilized over the network, combining channel, spatial, and pixel attention schemes for faster and efficient generalization of contextual information embedded in the feature map through feature recalibration and enhancement operations. Outstanding performances have been achieved in all three tasks on a large publicly available dataset containing 1110 chest CT-volumes.

Sleep Apnea Detection From Variational Mode Decomposed EEG Signal Using a Hybrid CNN-BiLSTM (Paper)

Tanvir Mahmud, Ishtiaque Ahmed Khan, Talha Ibn Mahmud, Shaikh Anowarul Fattah, Wei-Ping Zhu, and M. Omair Ahmad

IEEE ACCESS, Vol. 9, July 2021

 We propose an automated deep learning-based approach is proposed for the detection of sleep apnea frames from electroencephalogram (EEG) signals. Unlike conventional methods of direct feature extraction from EEG signals, the variational mode decomposition (VMD) algorithm is utilized in the proposed method to decompose the EEG signals into a number of modes. Afterward, a fully convolutional neural network (FCNN) is proposed to separately extract the temporal features from each VMD mode in parallel while maintaining their temporal dependencies. This study is carried out in a subject independent manner where separate subjects are considered for training and testing. Extensive experimentations on three publicly available datasets provide average accuracy of 93.22%, 93.25% and 89.41% in the subject-independent cross-validation scheme.

CovSegNet: A Multi Encoder–Decoder Architecture for Improved Lesion Segmentation of COVID-19 Chest CT Scans (Paper)

Tanvir Mahmud, Md Awsafur Rahman, and Shaikh Anowarul Fattah

IEEE Transactions on Artificial Intelligence, Vol. 2, June 2021

We propose an efficient two-phase training scheme is introduced where a deeper 2-D network is employed for generating region-of-interest (ROI)-enhanced CT volume followed by a shallower 3D network. Along with the traditional vertical expansion of Unet, we have introduced horizontal expansion with multistage encoder–decoder modules for achieving optimum performance. Additionally, multiscale feature maps are integrated into the scale transition process to overcome the loss of contextual information. Outstanding performances have been achieved in three publicly available datasets.

A Novel Multi-Stage Training Approach for Human Activity Recognition From Multimodal Wearable Sensor Data Using Deep Neural Network (Paper)

Tanvir Mahmud, A. Q. M. Sazzad Sayyed, Shaikh Anowarul Fattah, and Sun-Yuan Kung

IEEE Sensors Journal, Vol. 21, Issue 2, Jan 2021

We propose a novel multi-stage training approach that increases diversity in the feature extraction process to make accurate recognition of actions by combining varieties of features extracted from diverse perspectives. Initially, instead of using single type of transformation, numerous transformations are employed on time series data to obtain variegated representations of the features encoded in raw data. We achieve state-of-the-art accuracy of 99.29% on UCI HAR database, 99.02% on USC HAR database, and 97.21% on SKODA database. 

CovXNet: A multi-dilation convolutional neural network for automatic COVID-19 and other pneumonia detection from chest X-ray images with transferable multi-receptive feature optimization (Paper/Code)

Tanvir Mahmud, Md Awsafur Rahman, and Shaikh Anowarul Fattah

Computers in Biology and Medicine, July 2020

We propose a deep learning aided automated COVID-19 and other pneumonia detection schemes utilizing a small amount of COVID-19 chest X-rays. A deep convolutional neural network (CNN) based architecture, named as CovXNet, is proposed that utilizes depthwise convolution with varying dilation rates for efficiently extracting diversified features from chest X-rays.  Learning of the initial training phase is transferred with some additional fine-tuning layers that are further trained with a smaller number of COVID-19 chest X-rays. We achieve very satisfactory detection performance with accuracy of 97.4% accuracy. 

DeepArrNet: An Efficient Deep CNN Architecture for Automatic Arrhythmia Detection and Classification From Denoised ECG Beats (Paper)

Tanvir Mahmud, Shaikh Anowarul Fattah, and Mohammad Saquib

IEEE ACCESS, June 2020

We propose an efficient deep convolutional neural network (CNN) architecture based on depthwise temporal convolution along with a robust end-to-end scheme to automatically detect and classify arrhythmia from denoised electrocardiogram (ECG) signal. A structural unit, namely PTP (Pontwise-Temporal-Pointwise Convolution) unit, is designed with its variants where depthwise temporal convolutions with varying kernel sizes are incorporated along with prior and post pointwise convolution.  We achieve state-of-the-art performances on two publicly available datasets in all traditional evaluation metrics.

Education

Academic Service

Contact