Edge Multimodal Vision Research Track

Overview

This page is for students who are curious about modern AI that combines vision, language, and real-world deployment.

You do not need to already be an expert in multimodal AI, large models, or edge devices. This track is designed for students who want to start step by step, build confidence, and gradually grow into research.

The goal of this track is not to begin with the biggest models or the hardest papers. Instead, the goal is to help students explore questions such as:

How can AI understand images better with the help of language?
How can we make modern AI models smaller, faster, and more practical?
How can vision-language models remain useful in realistic environments?
How can we study modern AI topics without needing massive computing resources?

If you are interested in these kinds of questions, this track may be a good place to start.

What to avoid at the beginning

Do not begin with:

huge multimodal LLM fine-tuning
projects requiring very large GPU budgets
papers with no public code unless you are already experienced
complicated robotics or distributed systems before understanding the core perception problem

When to contact me

If you read some of the papers on this page and feel that you would like to study this area together, feel free to contact me.

You do not need to understand everything before reaching out.
Interest, curiosity, and willingness to learn step by step are enough.
Curiosity and steady effort matter more than perfection.

Part I. A simple starting path

A good starting path is the following:

Start with CLIP to understand the basic idea of vision-language learning.
Read MobileCLIP to see how this direction becomes practical for smaller and faster models.
Read MaPLe or TDA to understand how a pretrained model can be adapted efficiently.
Reproduce one public baseline on a small dataset.
Write a short memo about what worked, what was difficult, and what you want to try next.

You do not need to do everything at once.

Part II. Core Vision-Language / Efficient VLM

These papers are the minimum background. Read them first.

1. CLIP (ICML 2021)

Paper: Learning Transferable Visual Models From Natural Language Supervision

Code: https://github.com/openai/CLIP

Why read it: the starting point of modern vision-language transfer learning
Focus on: image-text alignment, zero-shot classification, why language supervision helps generalization

2. MobileCLIP (CVPR 2024)

Paper: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Code: https://github.com/apple/ml-mobileclip

Why read it: a strong recent paper showing that vision-language models can be made small and fast enough for mobile / edge settings
Focus on: accuracy-latency trade-off, model size, why efficient image-text models matter on devices

3. MaPLe (CVPR 2023)

Paper: MaPLe: Multi-Modal Prompt Learning
Code: https://github.com/muzairkhattak/multimodal-prompt-learning

Why read it: one of the clearest entry points for vision-language adaptation
Focus on: why adapting both vision and language branches can be better than prompting only one branch

Part III. Track A — Efficient Adaptation of Vision-Language Models

Typical question:

How can we adapt a pretrained vision-language model to a new task with limited data and limited computation?

Why this track is good

This track is suitable for students who want:

a modern topic with manageable experiments
strong connections to few-shot learning and parameter-efficient tuning
a practical path toward publication

Recommended order

A1. MaPLe (CVPR 2023)

Paper: MaPLe: Multi-Modal Prompt Learning
Code: https://github.com/muzairkhattak/multimodal-prompt-learning

Why read it: a strong and widely used multimodal prompt-learning baseline
Good for students because: conceptually clear and easy to position as a starting point

A2. PromptKD (CVPR 2024)

Paper: PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
Code: https://github.com/zhengli97/PromptKD

Why read it: a practical recent method for distilling prompt-based knowledge from larger models
Good for students because: public code is available and the story is easy to motivate

A3. R-MMA (WACV 2026)

Paper: R-MMA: Enhancing Vision-Language Models with Recurrent Adapters for Few-Shot and Cross-Domain Generalization

Code: https://github.com/farhanishmam/R-MMA

Why read it: a newer 2026 paper on lightweight adapters for few-shot and cross-domain generalization
Good for students because: it extends the adapter idea in a way that is still understandable after MaPLe and PromptKD

What students should reproduce first in this track

Choose one:

MaPLe
PromptKD

Then move to:

R-MMA

Good starter datasets for this track

CIFAR-10
CIFAR-100
Oxford Flowers
EuroSAT
DTD

Avoid huge datasets at the beginning.

Part IV. Track B — Robustness / Real-World Shift in Edge Multimodal Vision

Typical question:

How can vision-language models remain useful when the input distribution shifts because of device changes, resolution changes, blur, noise, or test-time mismatch?

Why this track is good

This track helps students build intuition for:

why deployment is harder than clean benchmark evaluation
why test-time adaptation matters
why efficient adaptation is especially important on devices

Recommended order

B1. TDA (CVPR 2024)

Paper: Efficient Test-Time Adaptation of Vision-Language Models
Code: https://github.com/kdiAAA/TDA

Why read it: a strong recent paper on efficient test-time adaptation for vision-language models
Main idea: training-free dynamic adapter with a lightweight cache
Good for students because: the setup is easy to explain and directly tied to real-world shift

B2. EdgeVL (ECCV 2024)

Paper: EdgeVL: Adapting Visual-Language Models to Edge Devices across Visual Modalities
Code: https://github.com/ramdrop/edgevl

Why read it: combines robustness, multimodal adaptation, and edge-device constraints
Main idea: dual-modality knowledge distillation and quantization-aware learning

B3. ArGue (CVPR 2024)

Paper: ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models

Code: https://github.com/xyu-tian/ArGue

Why read it: shows how attribute-guided prompts can help under shift
Good for students because: it is useful for reading and idea generation even if you do not reproduce it first

What students should reproduce first in this track

Choose one:

TDA
EdgeVL

Then use ArGue as follow-up reading for extension ideas.

Good starter experiments for this track

low-resolution corruption
blur / noise corruption
few-shot target-domain adaptation
RGB vs non-RGB or modality-shift comparison

Part V. Track C — On-Device / Edge Deployment of Multimodal Vision

Typical question:

How can we build multimodal visual systems that are actually fast and deployable on edge devices?

Why this track is attractive

strong current relevance
good fit for small-GPU labs
naturally connected to latency, compression, and hardware-aware deployment
valuable for both publications and student careers

Recommended order

C1. MobileCLIP (CVPR 2024)

Paper: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Code: https://github.com/apple/ml-mobileclip

Why read it: one of the clearest papers for efficient image-text modeling on mobile / edge devices
Main idea: compact image-text models with strong speed-accuracy trade-offs

C2. EdgeVL (ECCV 2024)

Paper: EdgeVL: Adapting Visual-Language Models to Edge Devices across Visual Modalities
Code: https://github.com/ramdrop/edgevl

Why read it: introduces adaptation to edge devices across multiple visual modalities
Main idea: knowledge distillation + quantization-aware learning for edge deployment

C3. FastVLM (CVPR 2025)

Paper: FastVLM: Efficient Vision Encoding for Vision Language Models
Code: https://github.com/apple/ml-fastvlm

Why read it: a recent CVPR paper focused on reducing vision encoding latency and visual token count
Main idea: efficient vision encoder for vision-language models
Good for students because: very relevant to modern multimodal systems but still measurable with latency-focused experiments

What students should reproduce first in this track

Choose one:

MobileCLIP
FastVLM

Then compare against one standard CLIP-style baseline using:

top-1 accuracy
inference latency
parameter count
memory usage

Good starter benchmarks for this track

ImageNet subset
CIFAR-100
Oxford Flowers
text-rich image benchmarks only after basic setup is stable

Suggested first mini-project

Compare:

a frozen CLIP-style baseline
one efficient adaptation method such as MaPLe or TDA
one efficient model such as MobileCLIP

on a small image benchmark, and compare:

accuracy
latency
model size

This is a good first project because it teaches the basic trade-off of edge multimodal vision: performance vs efficiency.

Edge Multimodal Vision Research Track

Overview

What to avoid at the beginning

When to contact me

Part I. A simple starting path

Part II. Core Vision-Language / Efficient VLM

1. CLIP (ICML 2021)

2. MobileCLIP (CVPR 2024)

3. MaPLe (CVPR 2023)

Part III. Track A — Efficient Adaptation of Vision-Language Models

Why this track is good

Recommended order

What students should reproduce first in this track

Good starter datasets for this track

Part IV. Track B — Robustness / Real-World Shift in Edge Multimodal Vision

Why this track is good

Recommended order

What students should reproduce first in this track

Good starter experiments for this track

Part V. Track C — On-Device / Edge Deployment of Multimodal Vision

Why this track is attractive

Recommended order

What students should reproduce first in this track

Good starter benchmarks for this track

More recent papers to read later, including 2026

Suggested first mini-project