Ensuring accessible pedestrian navigation requires understanding both what is present in a scene and how objects are arranged in space. However, many existing vision–language AI systems can describe images without reliably connecting their responses to specific regions, which can lead to incorrect or unsafe guidance. This limitation reduces their usefulness for real-world accessibility support.
We introduce WalkGPT, an AI system designed to provide grounded, accessibility-aware navigation guidance from pedestrian-view images. Given a street image and a user’s question, WalkGPT identifies walkable areas, highlights potential hazards, estimates relative distances, and generates clear, conversational responses grounded directly in the scene. By tightly linking language understanding with pixel-level scene interpretation, WalkGPT produces guidance that is both spatially aware and visually grounded.
We also introduce PAVE, a large-scale benchmark of 41K pedestrian-view images paired with accessibility-focused questions and grounded answers. Experiments demonstrate that WalkGPT delivers more reliable and spatially consistent navigation guidance compared to existing approaches.