Eric Hannus1, Miika Malin2, Tran Nguyen Le3, Ville Kyrki1
1Aalto University, 2University of Oulu, 3Technical University of Denmark
Preprint | Code (coming soon) | Dataset (coming soon)
Vision-language-action models (VLAs) must generate real-time actions, which limits the practical model sizes and consequently their ability to deal with semantically complex tasks.
We propose IA-VLA, a framework that utilizes the extensive language understanding of a large vision language model (VLM) as a pre-processing stage to generate improved context to augment the input of a VLA.
We study tasks involving visual duplicates, i.e., visually indistinguishable objects, which require complex instructions using relative positional relationships to identify the target object instance.
We use semantic segmentation to label image regions which a VLM then uses to identify the masks of the task-relevant objects. The task-relevant objects are highlighted in the VLA input, together with the language instruction which can optionally be simplified. We refer to the framework with simplified instruction as IA-VLA-relabeled.
Seen language instruction, potentially unseen configuration:
lift the second green block from the right
lift the second green block from the right
lift the second green block from the right
put the cucumber in the third pot from the left
put the cucumber in the third pot from the left
put the cucumber in the third pot from the left
open the second drawer from the right on the top row
open the second drawer from the right on the top row
open the second drawer from the right on the top row
When evaluated on seen instructions the baseline VLA often succeeds, but in task settings with a large space of unseen configurations, like in the block lifting setting, the input augmentation is beneficial.
Unseen language instruction, combines seen concepts:
lift the third green block from the left
lift the third green block from the left
lift the third green block from the left
put the carrots in the rightmost pot
put the carrots in the rightmost pot
put the carrots in the rightmost pot
open the third drawer from the right on the top row
open the third drawer from the right on the top row
open the third drawer from the right on the top row
When evaluated on more difficult instructions that contain unseen combinations of seen concepts (such as object appearances and relative positions) the input augmentation improves success rates. In the drawers setting IA-VLA-relabeled performs more robustly than IA-VLA with original instructions.
Unseen language instruction, extrapolates from seen concepts:
lift the fifth blue block from the right
lift the fifth blue block from the right
lift the fifth blue block from the right
put the tomato in the fourth pot from the right
put the tomato in the fourth pot from the right
put the tomato in the fourth pot from the right
open the fourth drawer from the left on the middle row
open the fourth drawer from the left on the middle row
open the fourth drawer from the left on the middle row
When evaluated on the most difficult instructions, which require extrapolation of seen concepts (e.g. 2nd and 3rd from left have been shown in demonstrations but not 4th from left) the benefits of input augmentation become most apparent. In the drawers setting the performance difference between IA-VLA-relabeled and IA-VLA becomes more apparent (see the paper for discussion).
To cite this work, please use the following BibTex entry:
@inproceedings{hannus2025iavla,
title={IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks},
author={Eric Hannus and Miika Malin and Tran Nguyen Le and Ville Kyrki},
year={2025},
journal = {arXiv preprint arXiv:2509.24768},
eprint={2509.24768},
archivePrefix={arXiv},
primaryClass={cs.RO},
}