Vision-Language-Action Research Track

Overview

This page is for students who want to study from vision-language models and robot learning basics to vision-language-action (VLA) models and eventually work on research topics such as:

Efficient / Small-Scale VLA
Reasoning-Centric VLA
Adaptation / Post-Training for VLA

This page is designed for self-study. The goal is not to start from the largest models, but to build enough intuition and experimental experience to understand the current VLA literature and eventually propose practical research ideas.

How to use this page

Start from Core VLA.
Choose one track among Efficient / Reasoning / Adaptation.
Read papers in order.
Reproduce at least one baseline using the provided code.

What I am looking for

I am looking for students who can gradually become capable of reading, reproducing, and eventually extending recent VLA papers.

Good candidates usually:

know basic PyTorch training
can read experiments carefully
can compare methods instead of only summarizing them
can reproduce at least one public codebase
can propose a small but concrete research question

Part I. What I recommend students do first

Option 1. Easiest and most practical path

Read the VLA survey
Read OpenVLA
Read TinyVLA
Reproduce TinyVLA or OpenVLA-OFT on a small benchmark
Then move to SGPT-style adaptation ideas or a small reasoning paper

Option 2. Best path for publishable ideas

Read the VLA survey
Read TinyVLA
Read LLaRA
Read OpenVLA-OFT
Choose one among Hi Robot / CoT-VLA / RIPT-VLA

Option 3. Best path for students interested in current trends

TinyVLA
OpenVLA-OFT
Hi Robot
CoT-VLA
RIPT-VLA

Part II. Suggested first mini-projects

Students should not begin with a huge project. Start with a small project that can realistically be finished.

Project idea A — efficient VLA

Compare:

OpenVLA
TinyVLA
OpenVLA-OFT

on a small benchmark such as LIBERO or one small real-robot setting. Then study speed, success rate, and data efficiency.

Project idea B — reasoning-centric VLA

Compare:

OpenVLA
Hi Robot
CoT-VLA

and study whether explicit reasoning or hierarchical decomposition helps with long-horizon or ambiguous instructions.

Project idea C — post-training / adaptation

Compare:

OpenVLA-OFT
LLaRA
RIPT-VLA

and study which adaptation recipe works best under limited data or limited reward signals.

Part III. What to avoid at the beginning

Do not begin with:

the largest closed-source VLA systems
projects that require large-scale pretraining from scratch
papers that mainly scale by using huge robot datasets and huge GPU budgets
methods with no public code unless you are already experienced
long-horizon real-robot projects before you can reproduce a benchmark result

The goal is to build depth, not to fail because the setup is too large.

Part IV. When to contact me

Contact me after you satisfy some of the following.

I understand the basic VLA pipeline at a conceptual level.
I selected one track among efficient / reasoning / adaptation.
I read at least 3 papers in that track.
I ran at least one public codebase successfully.
I can explain the strengths and weaknesses of two methods.
I can suggest one concrete research question.
I prepared a 1–2 page memo.

Part V. Core VLA

These papers are the minimum background. Read them first.

1. VLA Survey (arXiv 2024)

Paper: A Survey on Vision-Language-Action Models for Embodied AI
Why read it: the best broad overview of what VLA means in embodied AI
Focus on: taxonomy, data sources, model design, action prediction, evaluation

2. OpenVLA (CoRL 2024)

Paper: OpenVLA: An Open-Source Vision-Language-Action Model
Project / Code: https://github.com/openvla/openvla
Why read it: the most important open-source VLA starting point
Focus on: how a pretrained VLM is adapted into a robot policy, action tokenization, cross-embodiment evaluation
Note: important as a reference paper, but too heavy to be the first reproduction target

3. TinyVLA (RA-L 2025)

Paper: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Project / Code: https://tiny-vla.github.io/ | https://github.com/JayceWen/tinyvla
Why read it: a very good entry paper if you care about practical, lighter-weight VLA
Focus on: faster inference, better data efficiency, and avoiding large pretraining
Good for students because: the motivation is concrete and the experiments are practical

Part VI. Track A — Efficient / Small-Scale VLA

This track is the most recommended starting point for students who want to work on VLA without immediately depending on huge compute.

Typical question:

How can we make VLA models faster, more data-efficient, and easier to adapt to small-scale settings?

Why start with this track

This track helps students build intuition for:

why current VLA models are often expensive
which parts of the pipeline dominate compute and latency
how to get strong results without training the biggest model

Recommended order

A1. TinyVLA (RA-L 2025)

Paper: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Project / Code: https://tiny-vla.github.io/ | https://github.com/JayceWen/tinyvla
Why read it: one of the clearest papers if you care about lighter-weight VLA
Main idea: start from a strong multimodal backbone and use a diffusion-policy decoder during fine-tuning
Good for students because: practical, concrete, and explicitly motivated by speed and data efficiency

A2. OpenVLA-OFT (RSS 2025)

Paper: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Code: https://github.com/moojink/openvla-oft
Why read it: a very useful engineering paper for understanding how to actually adapt a VLA effectively
Main idea: improve OpenVLA through better fine-tuning design choices such as parallel decoding, action chunking, and continuous action prediction
Good for students because: strong empirical story, less abstract than many architecture papers

A3. LLaRA (ICLR 2025)

Paper: LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Code: https://github.com/LostXine/LLaRA
Why read it: useful if you care about data efficiency and instruction-style supervision
Main idea: convert robot demonstrations into conversation-style instruction tuning data
Good for students because: clear idea, relatively easy to connect to practical follow-up projects

What students should reproduce first in this track

Choose one:

TinyVLA
OpenVLA-OFT

Then use one paper as follow-up reading:

LLaRA

Good starter benchmarks for this track

LIBERO
one small tabletop manipulation setup
one-task or few-task imitation settings

Avoid pretraining a VLA from scratch at the beginning.

Part VII. Track B — Reasoning-Centric VLA

This track is for students who are interested in making VLA models better at long-horizon, ambiguous, or feedback-driven tasks.

Typical question:

Can we make a VLA reason before acting, instead of directly predicting low-level actions from input tokens?

Why this track is good

This track is suitable for students who want:

a clear high-level research story
visually intuitive algorithmic ideas
a bridge between VLM reasoning and robot action generation

Recommended order

B1. Hi Robot (ICML 2025)

Paper: Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Why read it: one of the clearest recent examples of hierarchical reasoning in VLA
Main idea: use a hierarchical structure where the system first reasons about the next step and then executes it with low-level actions
Good for students because: easy to motivate and easy to explain conceptually

B2. CoT-VLA (CVPR 2025)

Paper: CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Project: https://research.nvidia.com/labs/dir/cot-vla/
Why read it: a strong recent paper on explicit visual reasoning before action generation
Main idea: generate future image frames as visual subgoals before predicting actions
Good for students because: highly intuitive, visual, and easier to discuss than more implicit reasoning methods

B3. RIPT-VLA (arXiv 2025)

Paper: Interactive Post-Training for Vision-Language-Action Models
Code: https://github.com/Ariostgx/ript-vla
Why read it: useful if you want a bridge from pure offline imitation to interactive improvement
Main idea: post-train a VLA using sparse binary success rewards
Good for students because: very practical and strongly focused on low-data adaptation

What students should reproduce first in this track

Choose one:

Hi Robot
CoT-VLA

Then use one follow-up reading:

RIPT-VLA

Good starter benchmarks for this track

LIBERO-Goal / Spatial / Long
one instruction-heavy tabletop task
small long-horizon simulation tasks

Avoid starting from the largest reasoning-heavy models.

Part VIII. Track C — Adaptation / Post-Training for VLA

This track is for students interested in the practical question of how to adapt an existing VLA to new tasks, robots, or low-data settings.

Typical question:

If I already have a pretrained VLA, what is the most practical way to make it work better on my setting?

Why this track is attractive

directly useful for real deployment
easy to connect with smaller experiments
good room for publishable empirical papers
often less dependent on training a huge model from scratch

Recommended order

C1. OpenVLA-OFT (RSS 2025)

Paper: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Code: https://github.com/moojink/openvla-oft
Why read it: one of the best papers for understanding practical fine-tuning design choices
Main idea: optimized fine-tuning recipe for faster and stronger OpenVLA adaptation
Good for students because: very concrete and experimentally grounded

C2. LLaRA (ICLR 2025)

Paper: LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Code: https://github.com/LostXine/LLaRA
Why read it: useful if you want to adapt a model through better data formatting and supervision
Main idea: robot demonstrations are converted into conversation-style instruction tuning data
Good for students because: practical and easy to extend

C3. RIPT-VLA (arXiv 2025)

Paper: Interactive Post-Training for Vision-Language-Action Models
Code: https://github.com/Ariostgx/ript-vla
Why read it: useful if you want to adapt a pretrained VLA with sparse interaction signals
Main idea: reinforcement-learning-based interactive post-training with only task-success rewards
Good for students because: low-data and strongly practical

What students should reproduce first in this track

Choose one:

OpenVLA-OFT
LLaRA

Then move to:

RIPT-VLA

Good starter benchmarks for this track

LIBERO
one-task adaptation settings
few-demonstration fine-tuning setups

Avoid overly complex multi-robot adaptation projects at the beginning.

Part IX. Recommended code / benchmark libraries

These libraries are useful because students should not waste too much time building everything from scratch.

1. OpenVLA

GitHub: https://github.com/openvla/openvla
Why use it: the main open-source VLA reference implementation

2. OpenVLA-OFT

GitHub: https://github.com/moojink/openvla-oft
Why use it: better starting point if you specifically care about practical fine-tuning recipes

3. TinyVLA

GitHub: https://github.com/JayceWen/tinyvla
Why use it: good entry point for smaller and faster VLA experiments

4. LLaRA

GitHub: https://github.com/LostXine/LLaRA
Why use it: useful for studying data-centric adaptation strategies

5. RIPT-VLA

GitHub: https://github.com/Ariostgx/ript-vla
Why use it: useful for low-data interactive post-training

6. Awesome-VLA

GitHub: https://github.com/yueen-ma/awesome-vla
Why use it: broad paper navigation after finishing the core list