VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Visual-Language Models

IROS25

Kui Wu, Shuhang Xu, Hao Chen, Churan Wang, Zhoujun Li, Yizhou Wang, Fangwei Zhong

Paper

Code

Framework

Case analysis

More Video Clips

Success Case

Failed Case & Reflection generation

<Discrepancy Analysis>:

The provided context analysis indicates that the target was last seen moving towards the right side, near a doorway, and partially obscured by concrete blocks. The robot's action sequence focused on moving forward and making slight turns, but the images show that the robot encountered more obstacles and never reached a clear line of sight to the doorway. The actual action sequence did not account for the presence of concrete blocks that required maneuvering around.

<Adjustment Suggestion>:

[Turn right, Move Forward, Move Forward, Turn Right, Move Forward, Turn Left]

Citation

@misc{wu2025vlmgoodassistantenhancing,

title={VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models},

author={Kui Wu and Shuhang Xu and Hao Chen and Churan Wang and Zhoujun Li and Yizhou Wang and Fangwei Zhong},

year={2025},

eprint={2505.20718},

archivePrefix={arXiv},

primaryClass={cs.CV},

url={https://arxiv.org/abs/2505.20718},

}