IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks

Eric Hannus1, Miika Malin2, Tran Nguyen Le3, Ville Kyrki1

1Aalto University, 2University of Oulu, 3Technical University of Denmark

Preprint | Code (coming soon) | Dataset (coming soon)

Video

Framework summary

Vision-language-action models (VLAs) must generate real-time actions, which limits the practical model sizes and consequently their ability to deal with semantically complex tasks.
We propose IA-VLA, a framework that utilizes the extensive language understanding of a large vision language model (VLM) as a pre-processing stage to generate improved context to augment the input of a VLA.
We study tasks involving visual duplicates, i.e., visually indistinguishable objects, which require complex instructions using relative positional relationships to identify the target object instance.
We use semantic segmentation to label image regions which a VLM then uses to identify the masks of the task-relevant objects. The task-relevant objects are highlighted in the VLA input, together with the language instruction which can optionally be simplified. We refer to the framework with simplified instruction as IA-VLA-relabeled.

Examples

Seen language instruction, potentially unseen configuration:

OpenVLA (baseline) ❌

lift the second green block from the right

IA-VLA (ours) ✅

lift the second green block from the right

IA-VLA-relabeled (ours)✅

lift the second green block from the right

OpenVLA (baseline) ✅

put the cucumber in the third pot from the left

IA-VLA (ours) ✅

put the cucumber in the third pot from the left

IA-VLA-relabeled (ours)✅

put the cucumber in the third pot from the left

OpenVLA (baseline) ✅

open the second drawer from the right on the top row

IA-VLA (ours) ✅

open the second drawer from the right on the top row

IA-VLA-relabeled (ours)✅

open the second drawer from the right on the top row

When evaluated on seen instructions the baseline VLA often succeeds, but in task settings with a large space of unseen configurations, like in the block lifting setting, the input augmentation is beneficial.

Unseen language instruction, combines seen concepts:

OpenVLA (baseline) ❌

lift the third green block from the left

IA-VLA (ours) ✅

lift the third green block from the left

IA-VLA-relabeled (ours)✅

lift the third green block from the left

OpenVLA (baseline) ❌

put the carrots in the rightmost pot

IA-VLA (ours) ✅

put the carrots in the rightmost pot

IA-VLA-relabeled (ours)✅

put the carrots in the rightmost pot

OpenVLA (baseline) ❌

open the third drawer from the right on the top row

IA-VLA (ours) ❌

open the third drawer from the right on the top row

IA-VLA-relabeled (ours)✅

open the third drawer from the right on the top row

When evaluated on more difficult instructions that contain unseen combinations of seen concepts (such as object appearances and relative positions) the input augmentation improves success rates. In the drawers setting IA-VLA-relabeled performs more robustly than IA-VLA with original instructions.

Unseen language instruction, extrapolates from seen concepts:

OpenVLA (baseline) ❌

lift the fifth blue block from the right

IA-VLA (ours) ✅

lift the fifth blue block from the right

IA-VLA-relabeled (ours)✅

lift the fifth blue block from the right

OpenVLA (baseline) ❌

put the tomato in the fourth pot from the right

IA-VLA (ours) ✅

put the tomato in the fourth pot from the right

IA-VLA-relabeled (ours)✅

put the tomato in the fourth pot from the right

OpenVLA (baseline) ❌

open the fourth drawer from the left on the middle row

IA-VLA (ours) ❌

open the fourth drawer from the left on the middle row

IA-VLA-relabeled (ours)✅

open the fourth drawer from the left on the middle row

When evaluated on the most difficult instructions, which require extrapolation of seen concepts (e.g. 2nd and 3rd from left have been shown in demonstrations but not 4th from left) the benefits of input augmentation become most apparent. In the drawers setting the performance difference between IA-VLA-relabeled and IA-VLA becomes more apparent (see the paper for discussion).

Citation

To cite this work, please use the following BibTex entry:

@inproceedings{hannus2025iavla,

title={IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks},

author={Eric Hannus and Miika Malin and Tran Nguyen Le and Ville Kyrki},

year={2025},

journal = {arXiv preprint arXiv:2509.24768},

eprint={2509.24768},

archivePrefix={arXiv},

primaryClass={cs.RO},

}

Page updated

Google Sites

Report abuse