Imitation Learning

Imitation Learning(IL)

Robot imitation learning enables robots to acquire task behaviors directly from expert demonstrations, making it an effective approach for learning complex manipulation skills with high data efficiency. Our lab studies imitation learning in the context of vision-based robot control and Vision-Language-Action models. In particular, we propose RetoVLA, an architecture that reuses discarded register tokens from ViTs and injects them into the action generation module to improve spatial reasoning. This approach has shown strong performance gains in real robot tasks that require complex spatial understanding. We also develop Depth-ACT, a framework that combines RGB and depth encoders to provide richer 3D scene information for imitation learning policies.

Through these efforts, our research aims to improve spatial awareness, data efficiency, and real-world applicability in robot imitation learning.

인공지능학과 최재용 교수팀, 로보틱스 최고 권위

ICRA 2026 발표 논문 채택

DAUS 2026 국제학회 참석

RetoVLA is a lightweight Vision-Language-Action architecture designed to preserve spatial awareness in compressed robotic models.

Instead of discarding Register Tokens after visual encoding, it reuses them as a source of global scene context. These tokens are injected directly into the action-planning module through a dedicated spatial pathway, allowing the model to better capture 3D scene structure and layout. This design improves spatial reasoning without introducing any additional parameters.

In real-world experiments with a 7-DOF manipulator, RetoVLA achieved a 17.1%p higher average success rate than the baseline.

dynamicIL_clip.webm

This research proposes a VLA-guided aerial recovery system for safely capturing uncontrollable fixed-wing UAVs using a robotic arm. Unlike conventional recovery methods, the system combines vision, language instructions, and robot state information to analyze UAV motion, identify hazards such as propellers, and select safe capture points. To handle highly dynamic situations, the framework adopts a hierarchical control structure, where the VLA performs high-level reasoning and planning, while a low-level action expert executes fast, reactive control.

The study also highlights recent trends such as flow-based motion understanding, predictive planning, and real-time correction, which can improve the system’s robustness in real-world aerial recovery tasks.

Google Sites

Report abuse