Edge Multimodal Vision Research Track
Overview
This page is for students who are curious about modern AI that combines vision, language, and real-world deployment.
You do not need to already be an expert in multimodal AI, large models, or edge devices. This track is designed for students who want to start step by step, build confidence, and gradually grow into research.
The goal of this track is not to begin with the biggest models or the hardest papers. Instead, the goal is to help students explore questions such as:
How can AI understand images better with the help of language?
How can we make modern AI models smaller, faster, and more practical?
How can vision-language models remain useful in realistic environments?
How can we study modern AI topics without needing massive computing resources?
If you are interested in these kinds of questions, this track may be a good place to start.
What to avoid at the beginning
Do not begin with:
huge multimodal LLM fine-tuning
projects requiring very large GPU budgets
papers with no public code unless you are already experienced
complicated robotics or distributed systems before understanding the core perception problem
When to contact me
If you read some of the papers on this page and feel that you would like to study this area together, feel free to contact me.
You do not need to understand everything before reaching out.
Interest, curiosity, and willingness to learn step by step are enough.
Curiosity and steady effort matter more than perfection.
Part I. A simple starting path
A good starting path is the following:
Start with CLIP to understand the basic idea of vision-language learning.
Read MobileCLIP to see how this direction becomes practical for smaller and faster models.
Read MaPLe or TDA to understand how a pretrained model can be adapted efficiently.
Reproduce one public baseline on a small dataset.
Write a short memo about what worked, what was difficult, and what you want to try next.
You do not need to do everything at once.
Part II. Core Vision-Language / Efficient VLM
These papers are the minimum background. Read them first.
1. CLIP (ICML 2021)
Paper: Learning Transferable Visual Models From Natural Language Supervision
Code: https://github.com/openai/CLIP
Why read it: the starting point of modern vision-language transfer learning
Focus on: image-text alignment, zero-shot classification, why language supervision helps generalization
2. MobileCLIP (CVPR 2024)
Paper: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Code: https://github.com/apple/ml-mobileclip
Why read it: a strong recent paper showing that vision-language models can be made small and fast enough for mobile / edge settings
Focus on: accuracy-latency trade-off, model size, why efficient image-text models matter on devices
3. MaPLe (CVPR 2023)
Paper: MaPLe: Multi-Modal Prompt Learning
Code: https://github.com/muzairkhattak/multimodal-prompt-learning
Why read it: one of the clearest entry points for vision-language adaptation
Focus on: why adapting both vision and language branches can be better than prompting only one branch
Part III. Track A — Efficient Adaptation of Vision-Language Models
Typical question:
How can we adapt a pretrained vision-language model to a new task with limited data and limited computation?
Why this track is good
This track is suitable for students who want:
a modern topic with manageable experiments
strong connections to few-shot learning and parameter-efficient tuning
a practical path toward publication
Recommended order
A1. MaPLe (CVPR 2023)
Paper: MaPLe: Multi-Modal Prompt Learning
Code: https://github.com/muzairkhattak/multimodal-prompt-learning
Why read it: a strong and widely used multimodal prompt-learning baseline
Good for students because: conceptually clear and easy to position as a starting point
A2. PromptKD (CVPR 2024)
Paper: PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
Code: https://github.com/zhengli97/PromptKD
Why read it: a practical recent method for distilling prompt-based knowledge from larger models
Good for students because: public code is available and the story is easy to motivate
A3. R-MMA (WACV 2026)
Code: https://github.com/farhanishmam/R-MMA
Why read it: a newer 2026 paper on lightweight adapters for few-shot and cross-domain generalization
Good for students because: it extends the adapter idea in a way that is still understandable after MaPLe and PromptKD
What students should reproduce first in this track
Choose one:
MaPLe
PromptKD
Then move to:
R-MMA
Good starter datasets for this track
CIFAR-10
CIFAR-100
Oxford Flowers
EuroSAT
DTD
Avoid huge datasets at the beginning.
Part IV. Track B — Robustness / Real-World Shift in Edge Multimodal Vision
Typical question:
How can vision-language models remain useful when the input distribution shifts because of device changes, resolution changes, blur, noise, or test-time mismatch?
Why this track is good
This track helps students build intuition for:
why deployment is harder than clean benchmark evaluation
why test-time adaptation matters
why efficient adaptation is especially important on devices
Recommended order
B1. TDA (CVPR 2024)
Paper: Efficient Test-Time Adaptation of Vision-Language Models
Code: https://github.com/kdiAAA/TDA
Why read it: a strong recent paper on efficient test-time adaptation for vision-language models
Main idea: training-free dynamic adapter with a lightweight cache
Good for students because: the setup is easy to explain and directly tied to real-world shift
B2. EdgeVL (ECCV 2024)
Paper: EdgeVL: Adapting Visual-Language Models to Edge Devices across Visual Modalities
Code: https://github.com/ramdrop/edgevl
Why read it: combines robustness, multimodal adaptation, and edge-device constraints
Main idea: dual-modality knowledge distillation and quantization-aware learning
B3. ArGue (CVPR 2024)
Paper: ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models
Code: https://github.com/xyu-tian/ArGue
Why read it: shows how attribute-guided prompts can help under shift
Good for students because: it is useful for reading and idea generation even if you do not reproduce it first
What students should reproduce first in this track
Choose one:
TDA
EdgeVL
Then use ArGue as follow-up reading for extension ideas.
Good starter experiments for this track
low-resolution corruption
blur / noise corruption
few-shot target-domain adaptation
RGB vs non-RGB or modality-shift comparison
Part V. Track C — On-Device / Edge Deployment of Multimodal Vision
Typical question:
How can we build multimodal visual systems that are actually fast and deployable on edge devices?
Why this track is attractive
strong current relevance
good fit for small-GPU labs
naturally connected to latency, compression, and hardware-aware deployment
valuable for both publications and student careers
Recommended order
C1. MobileCLIP (CVPR 2024)
Paper: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Code: https://github.com/apple/ml-mobileclip
Why read it: one of the clearest papers for efficient image-text modeling on mobile / edge devices
Main idea: compact image-text models with strong speed-accuracy trade-offs
C2. EdgeVL (ECCV 2024)
Paper: EdgeVL: Adapting Visual-Language Models to Edge Devices across Visual Modalities
Code: https://github.com/ramdrop/edgevl
Why read it: introduces adaptation to edge devices across multiple visual modalities
Main idea: knowledge distillation + quantization-aware learning for edge deployment
C3. FastVLM (CVPR 2025)
Paper: FastVLM: Efficient Vision Encoding for Vision Language Models
Code: https://github.com/apple/ml-fastvlm
Why read it: a recent CVPR paper focused on reducing vision encoding latency and visual token count
Main idea: efficient vision encoder for vision-language models
Good for students because: very relevant to modern multimodal systems but still measurable with latency-focused experiments
What students should reproduce first in this track
Choose one:
MobileCLIP
FastVLM
Then compare against one standard CLIP-style baseline using:
top-1 accuracy
inference latency
parameter count
memory usage
Good starter benchmarks for this track
ImageNet subset
CIFAR-100
Oxford Flowers
text-rich image benchmarks only after basic setup is stable
More recent papers to read later, including 2026
These are useful after finishing the main path above. None of the papers below duplicate the main Track A/B/C reading lists.
DePT (CVPR 2024)
Paper: DePT: Decoupled Prompt Tuning
Code: https://github.com/Koorye/DePTDPC (CVPR 2025)
Paper: DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models
Code: https://github.com/JREion/DPCSkip Tuning (CVPR 2025)
Paper: Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters
Code: https://github.com/Koorye/SkipTuningTwo-Stage VLM Adaptation (CVPR 2025)
Paper: Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
Code: https://github.com/FarinaMatteo/rethinking_fewshot_vlmsTAPT (CVPR 2025)
Paper: TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models
Code: https://github.com/xinwong/TAPTSplit Split Adaptation (CVPR 2025)
Paper: Split Adaptation for Pre-trained Vision Transformers
Code: https://github.com/conditionWang/Split_AdaptationProLIP (WACV 2026)
Paper: CLIP's Visual Embedding Projector is a Few-shot Cornucopia
Code: https://github.com/astra-vision/ProLIP
Suggested first mini-project
Compare:
a frozen CLIP-style baseline
one efficient adaptation method such as MaPLe or TDA
one efficient model such as MobileCLIP
on a small image benchmark, and compare:
accuracy
latency
model size
This is a good first project because it teaches the basic trade-off of edge multimodal vision: performance vs efficiency.