Vision-Language-Action Research Track
Overview
This page is for students who want to study from vision-language models and robot learning basics to vision-language-action (VLA) models and eventually work on research topics such as:
Efficient / Small-Scale VLA
Reasoning-Centric VLA
Adaptation / Post-Training for VLA
This page is designed for self-study. The goal is not to start from the largest models, but to build enough intuition and experimental experience to understand the current VLA literature and eventually propose practical research ideas.
How to use this page
Start from Core VLA.
Choose one track among Efficient / Reasoning / Adaptation.
Read papers in order.
Reproduce at least one baseline using the provided code.
What I am looking for
I am looking for students who can gradually become capable of reading, reproducing, and eventually extending recent VLA papers.
Good candidates usually:
know basic PyTorch training
can read experiments carefully
can compare methods instead of only summarizing them
can reproduce at least one public codebase
can propose a small but concrete research question
Part I. What I recommend students do first
Option 1. Easiest and most practical path
Read the VLA survey
Read OpenVLA
Read TinyVLA
Reproduce TinyVLA or OpenVLA-OFT on a small benchmark
Then move to SGPT-style adaptation ideas or a small reasoning paper
Option 2. Best path for publishable ideas
Read the VLA survey
Read TinyVLA
Read LLaRA
Read OpenVLA-OFT
Choose one among Hi Robot / CoT-VLA / RIPT-VLA
Option 3. Best path for students interested in current trends
TinyVLA
OpenVLA-OFT
Hi Robot
CoT-VLA
RIPT-VLA
Part II. Suggested first mini-projects
Students should not begin with a huge project. Start with a small project that can realistically be finished.
Project idea A — efficient VLA
Compare:
OpenVLA
TinyVLA
OpenVLA-OFT
on a small benchmark such as LIBERO or one small real-robot setting. Then study speed, success rate, and data efficiency.
Project idea B — reasoning-centric VLA
Compare:
OpenVLA
Hi Robot
CoT-VLA
and study whether explicit reasoning or hierarchical decomposition helps with long-horizon or ambiguous instructions.
Project idea C — post-training / adaptation
Compare:
OpenVLA-OFT
LLaRA
RIPT-VLA
and study which adaptation recipe works best under limited data or limited reward signals.
Part III. What to avoid at the beginning
Do not begin with:
the largest closed-source VLA systems
projects that require large-scale pretraining from scratch
papers that mainly scale by using huge robot datasets and huge GPU budgets
methods with no public code unless you are already experienced
long-horizon real-robot projects before you can reproduce a benchmark result
The goal is to build depth, not to fail because the setup is too large.
Part IV. When to contact me
Contact me after you satisfy some of the following.
I understand the basic VLA pipeline at a conceptual level.
I selected one track among efficient / reasoning / adaptation.
I read at least 3 papers in that track.
I ran at least one public codebase successfully.
I can explain the strengths and weaknesses of two methods.
I can suggest one concrete research question.
I prepared a 1–2 page memo.
Part V. Core VLA
These papers are the minimum background. Read them first.
1. VLA Survey (arXiv 2024)
Paper: A Survey on Vision-Language-Action Models for Embodied AI
Why read it: the best broad overview of what VLA means in embodied AI
Focus on: taxonomy, data sources, model design, action prediction, evaluation
2. OpenVLA (CoRL 2024)
Project / Code: https://github.com/openvla/openvla
Why read it: the most important open-source VLA starting point
Focus on: how a pretrained VLM is adapted into a robot policy, action tokenization, cross-embodiment evaluation
Note: important as a reference paper, but too heavy to be the first reproduction target
3. TinyVLA (RA-L 2025)
Paper: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Project / Code: https://tiny-vla.github.io/ | https://github.com/JayceWen/tinyvla
Why read it: a very good entry paper if you care about practical, lighter-weight VLA
Focus on: faster inference, better data efficiency, and avoiding large pretraining
Good for students because: the motivation is concrete and the experiments are practical
Part VI. Track A — Efficient / Small-Scale VLA
This track is the most recommended starting point for students who want to work on VLA without immediately depending on huge compute.
Typical question:
How can we make VLA models faster, more data-efficient, and easier to adapt to small-scale settings?
Why start with this track
This track helps students build intuition for:
why current VLA models are often expensive
which parts of the pipeline dominate compute and latency
how to get strong results without training the biggest model
Recommended order
A1. TinyVLA (RA-L 2025)
Paper: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Project / Code: https://tiny-vla.github.io/ | https://github.com/JayceWen/tinyvla
Why read it: one of the clearest papers if you care about lighter-weight VLA
Main idea: start from a strong multimodal backbone and use a diffusion-policy decoder during fine-tuning
Good for students because: practical, concrete, and explicitly motivated by speed and data efficiency
A2. OpenVLA-OFT (RSS 2025)
Paper: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Why read it: a very useful engineering paper for understanding how to actually adapt a VLA effectively
Main idea: improve OpenVLA through better fine-tuning design choices such as parallel decoding, action chunking, and continuous action prediction
Good for students because: strong empirical story, less abstract than many architecture papers
A3. LLaRA (ICLR 2025)
Paper: LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Why read it: useful if you care about data efficiency and instruction-style supervision
Main idea: convert robot demonstrations into conversation-style instruction tuning data
Good for students because: clear idea, relatively easy to connect to practical follow-up projects
What students should reproduce first in this track
Choose one:
TinyVLA
OpenVLA-OFT
Then use one paper as follow-up reading:
LLaRA
Good starter benchmarks for this track
LIBERO
one small tabletop manipulation setup
one-task or few-task imitation settings
Avoid pretraining a VLA from scratch at the beginning.
Part VII. Track B — Reasoning-Centric VLA
This track is for students who are interested in making VLA models better at long-horizon, ambiguous, or feedback-driven tasks.
Typical question:
Can we make a VLA reason before acting, instead of directly predicting low-level actions from input tokens?
Why this track is good
This track is suitable for students who want:
a clear high-level research story
visually intuitive algorithmic ideas
a bridge between VLM reasoning and robot action generation
Recommended order
B1. Hi Robot (ICML 2025)
Paper: Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Why read it: one of the clearest recent examples of hierarchical reasoning in VLA
Main idea: use a hierarchical structure where the system first reasons about the next step and then executes it with low-level actions
Good for students because: easy to motivate and easy to explain conceptually
B2. CoT-VLA (CVPR 2025)
Paper: CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Why read it: a strong recent paper on explicit visual reasoning before action generation
Main idea: generate future image frames as visual subgoals before predicting actions
Good for students because: highly intuitive, visual, and easier to discuss than more implicit reasoning methods
B3. RIPT-VLA (arXiv 2025)
Paper: Interactive Post-Training for Vision-Language-Action Models
Why read it: useful if you want a bridge from pure offline imitation to interactive improvement
Main idea: post-train a VLA using sparse binary success rewards
Good for students because: very practical and strongly focused on low-data adaptation
What students should reproduce first in this track
Choose one:
Hi Robot
CoT-VLA
Then use one follow-up reading:
RIPT-VLA
Good starter benchmarks for this track
LIBERO-Goal / Spatial / Long
one instruction-heavy tabletop task
small long-horizon simulation tasks
Avoid starting from the largest reasoning-heavy models.
Part VIII. Track C — Adaptation / Post-Training for VLA
This track is for students interested in the practical question of how to adapt an existing VLA to new tasks, robots, or low-data settings.
Typical question:
If I already have a pretrained VLA, what is the most practical way to make it work better on my setting?
Why this track is attractive
directly useful for real deployment
easy to connect with smaller experiments
good room for publishable empirical papers
often less dependent on training a huge model from scratch
Recommended order
C1. OpenVLA-OFT (RSS 2025)
Paper: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Why read it: one of the best papers for understanding practical fine-tuning design choices
Main idea: optimized fine-tuning recipe for faster and stronger OpenVLA adaptation
Good for students because: very concrete and experimentally grounded
C2. LLaRA (ICLR 2025)
Paper: LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Why read it: useful if you want to adapt a model through better data formatting and supervision
Main idea: robot demonstrations are converted into conversation-style instruction tuning data
Good for students because: practical and easy to extend
C3. RIPT-VLA (arXiv 2025)
Paper: Interactive Post-Training for Vision-Language-Action Models
Why read it: useful if you want to adapt a pretrained VLA with sparse interaction signals
Main idea: reinforcement-learning-based interactive post-training with only task-success rewards
Good for students because: low-data and strongly practical
What students should reproduce first in this track
Choose one:
OpenVLA-OFT
LLaRA
Then move to:
RIPT-VLA
Good starter benchmarks for this track
LIBERO
one-task adaptation settings
few-demonstration fine-tuning setups
Avoid overly complex multi-robot adaptation projects at the beginning.
Part IX. Recommended code / benchmark libraries
These libraries are useful because students should not waste too much time building everything from scratch.
1. OpenVLA
Why use it: the main open-source VLA reference implementation
2. OpenVLA-OFT
Why use it: better starting point if you specifically care about practical fine-tuning recipes
3. TinyVLA
Why use it: good entry point for smaller and faster VLA experiments
4. LLaRA
Why use it: useful for studying data-centric adaptation strategies
5. RIPT-VLA
Why use it: useful for low-data interactive post-training
6. Awesome-VLA
Why use it: broad paper navigation after finishing the core list