Autonomous Navigation with Visual-Language Models and Conformal Prediction
This project addresses a core challenge in embodied AI: enabling autonomous agents to navigate complex and unfamiliar environments safely, efficiently, and interpretably. It focuses on the integration of visual-language models (VLMs) with conformal prediction to create agents capable of high-level semantic reasoning while maintaining statistical guarantees on decision-making under uncertainty.
By leveraging VLMs for contextual understanding and goal grounding, the agent can interpret and act on open-ended instructions in visually rich environments. Conformal prediction is applied to calibrate confidence in the model’s decisions, allowing the system to detect uncertainty, avoid unsafe actions, and selectively backtrack when confidence thresholds are not met. A novel tree-style exploration strategy is introduced to enhance the agent's ability to retry promising alternatives without re-planning from scratch, improving robustness in long-horizon navigation tasks.
The key contributions of the project include the development of an integrated framework for perception, reasoning, and action under uncertainty; the implementation of confidence-calibrated exploration policies; and a prototype agent capable of real-time, uncertainty-aware navigation in simulated 3D environments. The system is designed to be generalizable across tasks and adaptable to diverse scenes without retraining.
Anticipated outcomes include improved safety, interpretability, and generalization in embodied decision making, along with modular tools that combine foundation model capabilities with statistical learning. By advancing the reliability of AI-driven agents in real-world environments, the project contributes toward scalable, trustworthy embodied AI systems capable of acting intelligently and safely in open-ended tasks.