WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

CVPR 26

Rafi Ibn Sultan¹ , Hui Zhu¹ , Xiangyu Zhou¹ , Chengyin Li² , Prashant Khanduri¹ , Marco Brocanelli³, Dongxiao Zhu¹

¹ Wayne State University, ² Henry Ford Hospital, ³ The Ohio State University

paper

code

dataset

Abstract

Ensuring accessible pedestrian navigation requires understanding both what is present in a scene and how objects are arranged in space. However, many existing vision–language AI systems can describe images without reliably connecting their responses to specific regions, which can lead to incorrect or unsafe guidance. This limitation reduces their usefulness for real-world accessibility support.

We introduce WalkGPT, an AI system designed to provide grounded, accessibility-aware navigation guidance from pedestrian-view images. Given a street image and a user’s question, WalkGPT identifies walkable areas, highlights potential hazards, estimates relative distances, and generates clear, conversational responses grounded directly in the scene. By tightly linking language understanding with pixel-level scene interpretation, WalkGPT produces guidance that is both spatially aware and visually grounded.

We also introduce PAVE, a large-scale benchmark of 41K pedestrian-view images paired with accessibility-focused questions and grounded answers. Experiments demonstrate that WalkGPT delivers more reliable and spatially consistent navigation guidance compared to existing approaches.

What is WalkGPT?

Given a pedestrian-view image, WalkGPT can:

Identify walkable paths
Highlight obstacles and hazards
Estimate distances to nearby objects
Generate clear, accessibility-aware guidance

WalkGPT is designed to support safer navigation by grounding its responses directly in the visual scene.

PAVE: The Dataset

PAVE is a large-scale benchmark for accessibility-aware pedestrian reasoning.

41K pedestrian-view images with accessibility-focused questions
Pixel-level segmentation masks for accessible and hazardous regions
Relative depth annotations for surrounding objects
Structured answers with grounding and spatial reasoning

Designed to evaluate grounded navigation guidance and depth-aware reasoning in vision–language models.

Architecture

Pixel-grounded LVLM that jointly performs language reasoning and segmentation
Multi-Scale Query Projector (MSQP) to aggregate spatial tokens for fine-grained grounding
Calibrated Text Projector (CTP) to align language embeddings with image regions
Region Alignment Loss to enforce precise text–region correspondence
Depth-aware inference for relative distance reasoning without user anchors

WalkGPT in Action

Acknowledgment

This work is supported by the National Institutes of Health (NIH), National Eye Institute under Grant R61EY037504.

Questions?

Contact rafis@wayne.edu to get more information on the project

Check out our lab: Trustworthy AI Lab, Wayne State University