Chang Chen Liang Lu* Lei Yang* Yinqiang Zhang Yizhou Chen Ruixing Jia Jia Pan*
The University of Hong Kong Centre for Transformative Garment Production
*Corresponding authors
Current exploration methods struggle to search for shops or restaurants in unknown open-world environments due to the lack of prior knowledge. Humans can leverage venue maps that offer valuable scene priors to aid exploration planning by correlating the signage in the scene with landmark names on the map. However, arbitrary shapes and styles of the texts on signage, along with multi-view inconsistencies, pose significant challenges for robots to recognize them accurately. Additionally, discrepancies between real-world environments and venue maps hinder the integration of text-level information into the planners. This paper introduces a novel signage-aware exploration system to address these challenges, enabling the robots to utilize venue maps effectively. We propose a signage understanding method that accurately detects and recognizes the texts on signage using a diffusion-based text instance retrieval method combined with a 2D-to-3D semantic fusion strategy. Furthermore, we design a venue map-guided exploration-exploitation planner that balances exploration in unknown regions using directional heuristics derived from venue maps and exploitation to get close and adjust orientation for better recognition. Experiments in large-scale shopping malls demonstrate our method's superior signage recognition performance and search efficiency, surpassing state-of-the-art text spotting methods and traditional exploration approaches.
We propose to leverage the textual information in a venue map to facilitate shop searching in unknown open-world environments. The robot localizes itself in the environment by recognizing and matching the texts on a shop sign to the venue map. Then the robot plans a direction to the next landmark `Briketenia'.
Our method first constructs a topological graph on a given venue map. Then, given the RGB-D image, the proposed signage understanding method recognizes the text on the signage and correlates it with the text set of the venue map. Once localized on the venue map, the next landmark goal is inferred to guide the selection of frontiers. Throughout the process, our system balances exploration and exploitation to improve signage recognition accuracy and coverage efficiency.
Belows are some examples of the signage recognition during exploration. Red boxes highlight the texts of interest.
Below is a supplementary figure for the signage recognition (Fig. 5 in the original paper). Due to the limited space, we only show two rows in the paper.
Limitations in the current signage understanding: Our method is limited by the performance of the off-the-shelf text-diffusion model and the text recognition model. The ESTextSpotter model (current SOTA) has two limitations: 1) Its performance relies on the large image size in its preprocessing, causing high computation overhead. 2) It is unsuitable to aggregate the latent features of long text. This is because it predicts a sequence of characters with a fixed length, and each has a latent feature before the classification head. Most of the characters are classified as empty, which cannot be used to provide valid embeddings. To ensure all texts with varied valid lengths have the same embedding shape for conducting 2D-to-3D fusion, we need to first average over the embeddings of the non-empty characters, which may confuse the features. We also tried some new Chinese VLMs and found they can recognize them well. However, our insight is that, without finetuning on any training set, a better way to recognize text in the open world is to compare the appearance-based similarity with a pre-defined text set in the latent space, regardless of what models are used. Although a large model performs well in some cases, it cannot guarantee it will always recognize correctly in the open world. That is why we need a 2D-to-3D fusion mechanism as a filter to enhance the robustness.
The authors would like to thank Zhongxuan Li, Yipeng Pan, and Yupu Lu for supporting the real-world experiments. The authors would also like to thank Yuecheng Liu for his useful comments on this work.
@ARTICLE{chen2025signage,
author={Chen, Chang and Lu, Liang and Yang, Lei and Zhang, Yinqiang and Chen, Yizhou and Jia, Ruixing and Pan, Jia},
journal={IEEE Robotics and Automation Letters},
title={Signage-Aware Exploration in Open World Using Venue Maps},
year={2025},
volume={10},
number={4},
pages={3414-3421},
keywords={Text recognition;Robots;Planning;Semantics;Feature extraction;Navigation;Three-dimensional displays;Shape;Location awareness;Image recognition;Autonomous agents;semantic scene understanding;mapping;planning under uncertainty},
doi={10.1109/LRA.2025.3540390}
}