Vision-Language-Action Research Track

Overview

This page is for students who want to study from vision-language models and robot learning basics to vision-language-action (VLA) models and eventually work on research topics such as:

This page is designed for self-study. The goal is not to start from the largest models, but to build enough intuition and experimental experience to understand the current VLA literature and eventually propose practical research ideas.

How to use this page

What I am looking for

I am looking for students who can gradually become capable of reading, reproducing, and eventually extending recent VLA papers.

Good candidates usually:

Part I. What I recommend students do first

Option 1. Easiest and most practical path

Option 2. Best path for publishable ideas

Option 3. Best path for students interested in current trends

Part II. Suggested first mini-projects

Students should not begin with a huge project. Start with a small project that can realistically be finished.

Project idea A — efficient VLA

Compare:

on a small benchmark such as LIBERO or one small real-robot setting. Then study speed, success rate, and data efficiency.

Project idea B — reasoning-centric VLA

Compare:

and study whether explicit reasoning or hierarchical decomposition helps with long-horizon or ambiguous instructions.

Project idea C — post-training / adaptation

Compare:

and study which adaptation recipe works best under limited data or limited reward signals.

Part III. What to avoid at the beginning

Do not begin with:

The goal is to build depth, not to fail because the setup is too large.

Part IV. When to contact me

Contact me after you satisfy some of the following.

Part V. Core VLA

These papers are the minimum background. Read them first.

1. VLA Survey (arXiv 2024)

2. OpenVLA (CoRL 2024)

3. TinyVLA (RA-L 2025)

Part VI. Track A — Efficient / Small-Scale VLA

This track is the most recommended starting point for students who want to work on VLA without immediately depending on huge compute.

Typical question:

How can we make VLA models faster, more data-efficient, and easier to adapt to small-scale settings?

Why start with this track

This track helps students build intuition for:

Recommended order

A1. TinyVLA (RA-L 2025)

A2. OpenVLA-OFT (RSS 2025)

A3. LLaRA (ICLR 2025)

What students should reproduce first in this track

Choose one:

Then use one paper as follow-up reading:

Good starter benchmarks for this track

Avoid pretraining a VLA from scratch at the beginning.

Part VII. Track B — Reasoning-Centric VLA

This track is for students who are interested in making VLA models better at long-horizon, ambiguous, or feedback-driven tasks.

Typical question:

Can we make a VLA reason before acting, instead of directly predicting low-level actions from input tokens?

Why this track is good

This track is suitable for students who want:

Recommended order

B1. Hi Robot (ICML 2025)

B2. CoT-VLA (CVPR 2025)

B3. RIPT-VLA (arXiv 2025)

What students should reproduce first in this track

Choose one:

Then use one follow-up reading:

Good starter benchmarks for this track

Avoid starting from the largest reasoning-heavy models.

Part VIII. Track C — Adaptation / Post-Training for VLA

This track is for students interested in the practical question of how to adapt an existing VLA to new tasks, robots, or low-data settings.

Typical question:

If I already have a pretrained VLA, what is the most practical way to make it work better on my setting?

Why this track is attractive

Recommended order

C1. OpenVLA-OFT (RSS 2025)

C2. LLaRA (ICLR 2025)

C3. RIPT-VLA (arXiv 2025)

What students should reproduce first in this track

Choose one:

Then move to:

Good starter benchmarks for this track

Avoid overly complex multi-robot adaptation projects at the beginning.

Part IX. Recommended code / benchmark libraries

These libraries are useful because students should not waste too much time building everything from scratch.

1. OpenVLA

2. OpenVLA-OFT

3. TinyVLA

4. LLaRA

5. RIPT-VLA

6. Awesome-VLA