Hierarchical Instruction-aware Embodied Visual Tracking

Code

Virtual Environment

Training - Goal-Conditioned trajectory collection

Multi-goal trajectories collection

Recalculate IoU-based reward

Evaluation - Tracking in Virtual environments

FlexibleRoom

ContainerYard

Suburb

ChemicalFactory

Comparison - Different Method & The impact of time latency

RL-based agent

A visual tracking agent trained using reinforcement learning is capable of real-time tracking of a target person within the complex environment. However, the agent is unable to actively adjust its distance or angle relative to the target, maintaining only a fixed optimal distance and a zero-degree relative angle throughout the tracking process.

OpenVLA

The agent employs our fine-tuned OpenVLA model with the prompt "Track the person" for real-time inference. While OpenVLA achieves basic real-time performance, it is only capable of maintaining tracking at a limited distance in complex environments. Additionally, the model experiences frequent viewpoint jitter during the initial tracking phase, indicating challenges in both accuracy and visual robustness.

VLM-based agent with Real-time Mode

The environment and target will keep running while the GPT is responding to the instructions. The agent easily loses the target from the view due to the latency of the response.

The left image is used for GPT reasoning, the right image is the tracker's observation.

VLM-based agent with Step- wise Pause Mode

The environment and target are paused while waiting for the GPT response. The agent could keep tracking the target for a while, however, such setting is unrealistic in real-world.

The left image is used for GPT reasoning, the right image is the tracker's observation.

Real World Deployment

IA-EVT was deployed on a wheeled robot and tested in real-world scenarios, demonstrating its performance and adaptability in dynamic environments.

We use RoboMaster EP, a 4-wheeled robot manufactured by DJI as robot platform.
The system was run on a laptop equipped with an NVIDIA RTX A3000 GPU, serving as both the computing platform and communication base station for real-time processing and control.

Adaptive to high-dynamic movement

The robot can adapt to the high dynamic movement of the target while simultaneously understanding and executing the given instructions, maintaining effective tracking throughout.

"Move closer to the target"
The robot could adjust itself closer to a stand person, and keep the current distance. The robot is able to accurately approach a stationary person and maintain a desired relative position, continuously tracking the target while preserving its spatial alignment.

"Get Closer to the target and keep him in the left"
The robot comprehends a closer distance and left-right positioning of instructions during its movement, generating a larger bounding box on the left and swiftly adjusting its relative position to the target in real-time.

"Move Further from the target and keep him in the left"
The robot comprehends the far away distance and left-right positioning of instructions during its movement, generating a smaller bounding boxes on the left and swiftly adjusting its relative position to the target in real-time.

Additional Environments Screen Shots

Old Factory

A deteriorated factory environment featuring numerous steel pillars and scattered wooden crates on the ground, which create potential visual occlusions and tracking challenges.

Container Yard

A scenario with stacked containers, dynamic light condition, assessing tracking robustness in an dynamic visual light condition setting

Desert Ruins

A historical ruins environment set in the desert, featuring scattered walls and pillars. These elements create a complex layout distribution challenge, requiring the agent to adapt to diverse spatial structures and occlusions.

Brass Gardens

A palace-style environment featuring unique narrow corridors and passageways, along with multi-level platforms connected by staircases. This scenario highlights movement patterns different from planar walking, challenging the agent’s ability to navigate and track in a multi-level structured space.

Modular Old Town

A European-style village built on a hillside, featuring interconnected indoor and outdoor corridors, as well as narrow, undulating stairways. This environment presents challenges in navigating complex architectural structures.

Roof City

A rooftop cityscape containing low obstacles such as protruding air ducts and scattered debris, requiring the agent to adapt to constrained spaces and varying elevation levels

Page updated

Google Sites

Report abuse