Edge Multimodal Vision Research Track

Overview

This page is for students who are curious about modern AI that combines vision, language, and real-world deployment.

You do not need to already be an expert in multimodal AI, large models, or edge devices. This track is designed for students who want to start step by step, build confidence, and gradually grow into research.

The goal of this track is not to begin with the biggest models or the hardest papers. Instead, the goal is to help students explore questions such as:

If you are interested in these kinds of questions, this track may be a good place to start.


What to avoid at the beginning

Do not begin with:


When to contact me

If you read some of the papers on this page and feel that you would like to study this area together, feel free to contact me.


Part I. A simple starting path

A good starting path is the following:

You do not need to do everything at once.


Part II. Core Vision-Language / Efficient VLM

These papers are the minimum background. Read them first.

1. CLIP (ICML 2021)

Paper: Learning Transferable Visual Models From Natural Language Supervision

Code: https://github.com/openai/CLIP

Why read it: the starting point of modern vision-language transfer learning
Focus on: image-text alignment, zero-shot classification, why language supervision helps generalization

2. MobileCLIP (CVPR 2024)

Paper: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Code: https://github.com/apple/ml-mobileclip

Why read it: a strong recent paper showing that vision-language models can be made small and fast enough for mobile / edge settings
Focus on: accuracy-latency trade-off, model size, why efficient image-text models matter on devices

3. MaPLe (CVPR 2023)

Paper: MaPLe: Multi-Modal Prompt Learning
Code: https://github.com/muzairkhattak/multimodal-prompt-learning

Why read it: one of the clearest entry points for vision-language adaptation
Focus on: why adapting both vision and language branches can be better than prompting only one branch


Part III. Track A — Efficient Adaptation of Vision-Language Models

Typical question:

How can we adapt a pretrained vision-language model to a new task with limited data and limited computation?

Why this track is good

This track is suitable for students who want:

Recommended order

A1. MaPLe (CVPR 2023)

Paper: MaPLe: Multi-Modal Prompt Learning
Code: https://github.com/muzairkhattak/multimodal-prompt-learning

Why read it: a strong and widely used multimodal prompt-learning baseline
Good for students because: conceptually clear and easy to position as a starting point

A2. PromptKD (CVPR 2024)

Paper: PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
Code: https://github.com/zhengli97/PromptKD

Why read it: a practical recent method for distilling prompt-based knowledge from larger models
Good for students because: public code is available and the story is easy to motivate

A3. R-MMA (WACV 2026)

Paper: R-MMA: Enhancing Vision-Language Models with Recurrent Adapters for Few-Shot and Cross-Domain Generalization

Code: https://github.com/farhanishmam/R-MMA

Why read it: a newer 2026 paper on lightweight adapters for few-shot and cross-domain generalization
Good for students because: it extends the adapter idea in a way that is still understandable after MaPLe and PromptKD

What students should reproduce first in this track

Choose one:

Then move to:

Good starter datasets for this track

Avoid huge datasets at the beginning.


Part IV. Track B — Robustness / Real-World Shift in Edge Multimodal Vision

Typical question:

How can vision-language models remain useful when the input distribution shifts because of device changes, resolution changes, blur, noise, or test-time mismatch?

Why this track is good

This track helps students build intuition for:

Recommended order

B1. TDA (CVPR 2024)

Paper: Efficient Test-Time Adaptation of Vision-Language Models
Code: https://github.com/kdiAAA/TDA

Why read it: a strong recent paper on efficient test-time adaptation for vision-language models
Main idea: training-free dynamic adapter with a lightweight cache
Good for students because: the setup is easy to explain and directly tied to real-world shift

B2. EdgeVL (ECCV 2024)

Paper: EdgeVL: Adapting Visual-Language Models to Edge Devices across Visual Modalities
Code: https://github.com/ramdrop/edgevl

Why read it: combines robustness, multimodal adaptation, and edge-device constraints
Main idea: dual-modality knowledge distillation and quantization-aware learning

B3. ArGue (CVPR 2024)

Paper: ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models

Code: https://github.com/xyu-tian/ArGue

Why read it: shows how attribute-guided prompts can help under shift
Good for students because: it is useful for reading and idea generation even if you do not reproduce it first

What students should reproduce first in this track

Choose one:

Then use ArGue as follow-up reading for extension ideas.

Good starter experiments for this track


Part V. Track C — On-Device / Edge Deployment of Multimodal Vision

Typical question:

How can we build multimodal visual systems that are actually fast and deployable on edge devices?

Why this track is attractive

Recommended order

C1. MobileCLIP (CVPR 2024)

Paper: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Code: https://github.com/apple/ml-mobileclip

Why read it: one of the clearest papers for efficient image-text modeling on mobile / edge devices
Main idea: compact image-text models with strong speed-accuracy trade-offs

C2. EdgeVL (ECCV 2024)

Paper: EdgeVL: Adapting Visual-Language Models to Edge Devices across Visual Modalities
Code: https://github.com/ramdrop/edgevl

Why read it: introduces adaptation to edge devices across multiple visual modalities
Main idea: knowledge distillation + quantization-aware learning for edge deployment

C3. FastVLM (CVPR 2025)

Paper: FastVLM: Efficient Vision Encoding for Vision Language Models
Code: https://github.com/apple/ml-fastvlm

Why read it: a recent CVPR paper focused on reducing vision encoding latency and visual token count
Main idea: efficient vision encoder for vision-language models
Good for students because: very relevant to modern multimodal systems but still measurable with latency-focused experiments

What students should reproduce first in this track

Choose one:

Then compare against one standard CLIP-style baseline using:

Good starter benchmarks for this track


More recent papers to read later, including 2026

These are useful after finishing the main path above. None of the papers below duplicate the main Track A/B/C reading lists.


Suggested first mini-project

Compare:

on a small image benchmark, and compare:

This is a good first project because it teaches the basic trade-off of edge multimodal vision: performance vs efficiency.