Abstract
The rapid progress of vision–language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight.
In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution.
By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environment.
Fig. Overview of the AERMANI-VLM pipeline for vision-language-guided aerial manipulation.
(1) Input: a user provides a natural language command.
(2) Prompt compilation: the command is compiled into a structured prompt containing a preamble, reasoning history, skill definitions, and safety rules.
(3) VLM inference: together with the current RGB observation, the prompt is processed by a pretrained VLM, which outputs an image description, task summary, explicit reasoning trace, and a discrete skill to execute.
(4) Skill execution: the selected motion primitive or perception-driven routine is executed by deterministic low-level controllers, ensuring repeatability under flight dynamics.
This reasoning–action loop continues until task completion, enabling the VLM to focus on semantic reasoning while delegating precise execution to robust controllers.
Fig. Coordinate frames and spatial grounding in AERMANI-VLM.
(i) The global world frame TW anchors all transformations, defining poses for the aerial manipulator (WTAM ), target object (WTO ), and placement location (WTTP ).
(ii) Onboard frames for the camera (AMTC) and gripper (AMTG) are expressed relative to the manipulator body, maintaining consistency between perception and control.
Fig. The open-vocabulary perception pipeline for object localization (top row) and placement localization (bottom row). Given a natural language query (e.g., ”Purple Cup”), CLIPSeg performs zero-shot segmentation on the input RGB image. This 2D mask is then used to extract a filtered 3D point cloud from the depth data, allowing for the precise calculation of a grasp or placement pose.
AeriMani-VLM
We propose AERMANI-VLM, the first system that integrates pretrained vision–language reasoning with aerial manipulation. The core idea is to decouple what to do from how to do it: the VLM handles high-level reasoning under structured prompting, while a predefined library of flight-safe skills executes the resulting actions. This design retains the generalization ability of large models while ensuring stable, and repeatable control. The main contributions are:
1) To our knowledge, this is the first formalization of vision–language-guided aerial manipulation as sequential decision-making under partial observability, unifying natural-language reasoning, onboard perception, and task-level skill selection, without any task-specific training or finetuning.
2) Structured input prompting that encodes task intent, reasoning history, skill library, and safety constraints into a unified query for the pretrained VLM, enabling temporally consistent, context-aware, and physically valid reasoning while reducing hallucinations and maintaining task grounding.
3) Structured reasoning output via Descriptive Reasoning Trace (DRT) that constrains the pretrained VLM to produce interpretable, temporally grounded decision traces instead of free-text commands, enforcing reasoning-before-action execution for consistent skill selection.
Structured Prompting for Reasoning
The effectiveness of AERMANI-VLM depends not only on the pretrained VLM but also on how it is queried. At each timestep t, the structured prompt Pₜ is defined as:
Pₜ = { L, Preamble, History, Skill Library, Rules }
Each component organizes task context and operational constraints into a consistent input format.
Task Command (L): Specifies the overall objective in natural language.
Preamble: Defines the VLM’s role, output format, and safety constraints.
History: Stores past reasoning traces and executed skills to maintain temporal continuity.
Skill Library: Lists all executable routines, expressed in natural language for interpretability.
Rules: Enforce a structured response containing a Descriptive Reasoning Trace (DRT) and one Skill To Execute (STE).
This structure grounds the VLM’s reasoning in both semantic context and executable constraints, ensuring temporally consistent and physically valid decisions. Unlike unstructured prompting, it enforces verifiable reasoning sequences that link high-level inference to safety-critical aerial control.
Given Pₜ and the current observation oₜ, the VLM generates:
γₜ = π(oₜ, Pₜ) = { DRTₜ, STEₜ }
Here, the DRT provides the reasoning trace and the STE identifies the executable skill. The DRT is composed of four interpretable fields:
DRTₜ = { Image Description, Summary, Action Prediction, Reasoning }
Each field serves a distinct functional role:
Image Description: Grounds reasoning in the current visual scene for spatial awareness.
Summary: Records progress and past actions, maintaining temporal coherence under partial observability.
Action Prediction: Compels the model to evaluate all available skills before selection, reducing hallucinated or inconsistent outputs.
Reasoning: Synthesizes perception, memory, and predicted actions into a coherent justification.
Together, these components enforce a verifiable chain-of-thought, compelling the VLM to produce reasoning traces that can be inspected, validated, and enable human verification.
Results
Fig. Qualitative results from a real-world hardware experiment for the command: “Pick up the purple cup next to the coffee machine and place it on the wooden table.” Each panel shows the first-person onboard view (large bottom image) and two static third-person views (small top images). The numbered sequence illustrates the complete, autonomous execution of the task. (1–3) The AM performs an active search to find the target object. (4–5) It executes a visually-guided approach and grasp. (6–8) After securing the cup, it searches for the destination table. (9–10) Finally, it approaches the table and places the object.
Fig. Example output from the pretrained VLM under AERMANI-VLM prompting. The model generates a structured reasoning trace (DRT) consisting of image description, task summary, candidate action predictions with justifications, and step-by-step reasoning. From this trace, the system selects a discrete, flight-safe skill (STE) to execute.