Tanvir Mahmud, PhD

Software Engineer, Google Research

I am currently working at Google Research in Mountain View, CA, as a Software Engineer, contributing across Research, Google DeepMind, and Product verticals. Prior to this, I completed my Ph.D. in Electrical and Computer Engineering from The University of Texas at Austin, advised by Prof. Diana Marculescu.

My research focuses on multimodal learning, generative AI, and speech/audio processing. I have led several high-impact projects, resulting in first-author publications at top-tier conferences including EMNLP, CVPR, ECCV, ICLR, and ICASSP.

Previously, I interned at Google, Microsoft, and BOSCH Center for AI, contributing to state-of-the-art advancements in audio, NLP, and video understanding.

I hold both M.Sc. and B.Sc. degrees in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology (BUET), where I also served as a Lecturer, teaching undergraduate courses and mentoring junior researchers.

In my free time, I enjoy traveling with family, catching up on sleep, and reading about the latest developments in AI.

LinkedIn / Google Scholar / Github / Email

WORK EXPERIENCE

Software Engineer, Google Research (February 2025 – Current)

Mountain View, CA, 94043, USA

Working on multimodal generative AI, and LLM integration across research, and product verticals.

Research (PhD) Intern, Google (August 2024 – November 2024)

Mountain View, CA, 94043, USA

Developed neural network based beam-forming architecture to enhance ASR performance up to 6 dB/WER on binaural recordings of wearable earbuds, optimizing ASR outputs to support downstream m-LLM (Gemini) applications.

Research Intern, Microsoft (May 2024 – August 2024)

Redmond, WA, 98052, USA

Implemented precise editing of musical instruments (addition, replacement, removal, and transfer) using generative diffusion models. Achieved over 85% preference in user trials compared to state-of-the-art methods.

Research Intern, Microsoft (May 2023 – August 2023)

Redmond, WA, 98052, USA

Proposed a weakly-supervised audio separation framework from multi-source audio mixtures for music and audio processing applications, leveraging bi-modal audio-language models. Achieved 97.5% of single-source training performance while training with audio mixtures only.

Research Intern, BOSCH Center for Artificial Intelligence (May 2022 – August 2022)

Sunnyvale, CA, USA

Proposed a novel semi-supervised video object detection framework for autonomous driving applications that can achieve around 98% of supervised performance obtained with 15 frame annotations per video by simply adopting single frame annotation per video.

Lecturer, Bangladesh University of Engineering and Technology (March 2019 – August 2021)

Dhaka, Bangladesh

Taught undergrad courses; mentored research students; arranged conferences, and workshops.

EDUCATION

Ph.D. in Electrical and Computer Engineering (2021 – 2025)

The University of Texas at Austin, Austin, TX, USA (CGPA: 4.00/4.00)
Cockrell School of Engineering Fellowship (2021-2025)

M.Sc. in Electrical and Electronic Engineering (2018 – 2021)

Bangladesh University of Engineering and Technology, Dhaka, Bangladesh (CGPA: 4.00/4.00)

B.Sc. in Electrical and Electronic Engineering (2014 – 2018)

Bangladesh University of Engineering and Technology, Dhaka, Bangladesh (CGPA: 3.98/4.00)

Publications

Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior (Paper)

Tanvir Mahmud, Mustafa Munir, Radu Marculescu, and Diana Marculescu

The 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Arizona, USA

Video-to-video synthesis models struggle with consistent character generation, smooth temporal transitions, and quality during fast motion. Joint fully cross-frame self-attention mechanisms improve character consistency but add significant computational complexity, limiting frame count and introducing redundancy, which affects temporal consistency and quality. To address these issues, we propose an adaptive motion-guided cross-frame attention that reduces complexity while preserving semantic details and consistency. By selectively incorporating moving regions based on optical flow sampling, we enable editing more frames jointly without extra computational cost. For longer videos, we introduce KV-caching of edited frames to enhance intermediate frame quality and temporal consistency, allowing our approach, Ada-VE, to achieve 3x more keyframes and up to 4x speed-up across 40 frames without quality loss.

OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation (Paper)

Tanvir Mahmud, Diana Marculescu

The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Main), Florida, USA

Audio separation in real-world scenarios, where mixtures contain a variable number of sources, presents significant challenges due to limitations of existing models, such as over-separation, under-separation, and dependence on predefined training sources. We propose OpenSep, a novel framework that leverages large language models (LLMs) for automated audio separation, eliminating the need for manual intervention and overcoming source limitations. OpenSep uses textual inversion to generate captions from audio mixtures with off-the-shelf audio captioning models, effectively parsing the sound sources present. It then employs few-shot LLM prompting to extract detailed audio properties of each parsed source, facilitating separation in unseen mixtures. Additionally, we introduce a multi-level extension of the mix-and-separate training framework to enhance modality alignment by separating single source sounds and mixtures simultaneously. Extensive experiments demonstrate OpenSep's superiority in precisely separating new, unseen, and variable sources in challenging mixtures, outperforming SOTA baseline methods.

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference (Paper)

Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu

The European Conference on Computer Vision (ECCV), Milan, 2024

As deep neural networks evolve from ConvNets to advanced vision transformers (ViTs), eliminating redundant data for faster processing without losing accuracy becomes crucial. Current methods are often architecture-specific or require re-training, limiting their flexibility. We address this by revealing a new property of lightweight ConvNets: their capacity to identify key discriminative patches in images independently of model accuracy or size. We show that suppressing fully-connected layers through simple weight recalibration improves patch localization. Building on this, we introduce PaPr, a patch pruning method for reducing redundancy across various architectures—ConvNets, ViTs, and hybrids—without re-training. PaPr achieves up to 70% patch reduction in videos with less than 0.8% accuracy loss and up to 3.7x FLOPs reduction, outperforming existing methods by 15% in reduction with 2.5% higher accuracy.

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures (Paper)

Tanvir Mahmud, Yapeng Tian, Diana Marculescu

IEEE IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, 2024

Visual sound source localization is challenging, especially in identifying each sound source's region in multi-source videos. Current self- and weakly supervised methods struggle in complex multi-source scenarios due to limited audio-visual correspondence. To address this, we introduce T-VSL, which incorporates text as an intermediate guide using tri-modal embeddings (e.g., AudioCLIP) to disentangle audio-visual correspondence. T-VSL predicts sound classes and uses text representations to refine source localization, allowing flexible source handling and strong zero-shot transfer to unseen classes. Experiments on MUSIC, VGGSound, and VGGSound-Instruments show significant improvements over current methods.

Weakly-supervised Audio Separation via Bi-modal Semantic Similarity (Paper)

Tanvir Mahmud, Saeed Amizadeh, Kazuhito Koishida, Diana Marculescu

The Twelfth International Conference on Learning Representations (ICLR), Vienna, 2024

Separating single-source sounds in multi-source audio mixtures without single-source training data is a long-standing challenge. Existing methods struggle with multi-source mixtures due to the lack of single-source supervision. However, in language-conditional audio separation, we have text descriptions for each mixture in the training data, which serve as rough representations of the audio. In this paper, we propose a bi-modal separation framework that enhances unsupervised separation by leveraging signals in the conditioning modality (language) to separate single-source audio without single-source supervision. Using a pretrained joint embedding model (CLAP), we show that our method improves unsupervised baselines by reducing the distribution shift between training and test data, achieving a 71% SDR boost—reaching 97.5% of supervised performance. Additionally, we demonstrate a 17% improvement in supervised learning when augmented with our weakly-supervised framework, enabling a robust semi-supervised approach for audio separation.

SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations (Paper)

Tanvir Mahmud, Chun-Hao Liu, Burhaneddin Yaman, and Diana Marculescu

IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, Hawai, 2024

Despite progress in semi-supervised image object detection, video object detection still faces challenges: (1) strong performance relies on annotated frames, (2) annotating multiple video frames is costly and often redundant, and (3) current techniques for static images don’t leverage video motion dynamics. We introduce SSVOD, an end-to-end semi-supervised video object detection framework that uses motion dynamics to make use of unlabeled frames with sparse annotations. By using flow-warped predictions for temporal consistency, we assemble robust pseudo-labels with cross-IoU and cross-divergence for bounding boxes and class labels. To balance confirmation bias and noise, we combine hard and soft pseudo-labels with confidence thresholds. SSVOD shows notable improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS datasets.

CLIP4VIDEOCAP: RETHINKING CLIP FOR VIDEO CAPTIONING WITH MULTISCALE TEMPORAL FUSION AND COMMONSENSE KNOWLEDGE (Paper)

Tanvir Mahmud, Feng Liang, Yaling Qing, and Diana Marculescu

2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhoades, Greece, 2023

We introduce AVE-CLIP to exploit AudioCLIP pre-trained on large-scale audio-image pairs for improving inter-modal feature correspondence on video AVEs. We propose a multi-window temporal transformer based fusion scheme that operates on different timescales of AVE frames to extract local and global variations of multi-modal features. We introduce a temporal feature refinement scheme to increase contrast with the background. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement which proves its superiority over existing approaches.

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization (Paper)

Tanvir Mahmud, and Diana Marculescu

IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, Hawai, 2023

An audio-visual event (AVE) aligns visual and auditory signals in a video segment, but precise AVE localization is challenging due to the need for multi-modal feature correspondence across short and long temporal interactions. Existing methods struggle with capturing these scales due to limited multi-modal training strategies. To address this, we introduce AVE-CLIP, a framework combining pre-trained AudioCLIP with a multi-window temporal transformer to handle different temporal scales. Our contributions: (1) a multi-stage training framework to integrate AudioCLIP with AVE localization via contrastive fine-tuning and multi-scale training, (2) a multi-domain attention mechanism for local-global feature fusion, and (3) a temporal refining scheme with event-guided attention and post-processing for handling background variation. AVE-CLIP achieves state-of-the-art performance on the AVE dataset with a 5.9% accuracy improvement.

RL-Tune: A Deep Reinforcement Learning Assisted Layer-wise Fine-Tuning for Transfer Learning (Paper)

Tanvir Mahmud, Natalia A. Frumkin, and Diana Marculescu

International Conferernce on Machine Learning (ICML) Pretraining Workshop, Baltimore, Maryland, 2022

We address two critical challenges with transfer learning via fine-tuning: (1) The required amount of fine-tuning greatly depends on the distribution shift from source to target dataset. (2) This distribution shift greatly varies by layer, thereby requiring layer-wise adjustments in fine-tuning to adapt to this distribution shift while preserving the pre-trained network’s feature representation. To overcome these challenges, we propose RL-Tune, a layer-wise fine-tuning framework for transfer learning which leverages reinforcement learning to adjust learning rates as a function of the target data shift. RL-Tune outperforms other state-of-the-art approaches on standard transfer learning benchmarks by a large margin, e.g., 6% mean accuracy improvement on CUB-200-2011 with 15% data.

PATENTS

Systems and Methods for Training Video Object Detection Machine Learning Model with Teacher and Student Framework
- Filed November 2022
- Inventor: Tanvir Mahmud, Chun-Hao Liu, Burhaneddin Yaman
Systems and Methods for Multi-Teacher Group-Distillation for Long-tail Classification
- Filed September 2022
- Inventor: Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu

AWARDS

EB-1A Green Card Recipient, for Outstanding Research Contributions
Best Paper Award, IEEE BECITHCON 2019

Academic Service

Reviewer: ICLR, NeurIPS, CVPR, ECCV, WACV

Page updated

Google Sites

Report abuse