The multimodal bidirectional interaction interface serves as the user’s primary access point to our framework. Here, the user’s input modalities and task instructions are detected, processed, and transcribed (i.e., in the case of vocal or audio instructions) into textual representations, suitable for further linguistic processing.
In the action decision and command parsing pipeline, we leverage the inherent capabilities of a large-scale pre-trained large language model (LLM) to reason over the high-level natural language instructions from (a) and parse them into robot-actionable commands.
ReLI’s visuo-lingual pipeline relies on open-vocabulary vision-language models (VLMs), e.g., CLIP and zero-shot computer vision models, e.g., Segment Anything Model (SAM). We further augmented these models with geometric depth fusion and uncertainty-aware classification to ground linguistic references into spatially localised entities within the agent's operational environment.
We operationalise the high-level intents derived from the action decision pipeline into physical robot actions through the action execution mechanism. Generally, the AEM manages all the navigation tasks, including path planning, obstacle avoidance, sensor-based information retrieval, and safety measures.