Learning to navigate from vision alone
Indoor navigation is a difficult task, as it generally comes with poor GPS access, forcing solutions to rely on other sources of information. While significant progress continues to be made in this area, deployment to production applications is still lacking, given the complexity and additional requirements of current solutions. Here, we introduce an efficient, real-time and easily deployable deep learning approach, based on visual input only, that can predict the direction towards a target from images captured by a mobile device. Our technical approach, based on a novel graph-based path generation method, combined with explainable data augmentation and curriculum learning, includes contributions that make the process of data collection, annotation and training, as automatic as possible, efficient and robust. On the practical side, we introduce a novel large-scale dataset, with video footage inside a relatively large shopping mall, in which each frame is annotated with the correct next direction towards different specific target destinations. Different from current methods, ours relies solely on vision, avoiding the need of special sensors, additional markers placed along the path, knowledge of the scene map or internet access.
Below is a systematic overview of our approach:
Automatic computation of instantaneous direction and processing
We extract camera motion from recorded video, using depth models, optical flow models, and intrinsic camera parameters.
Model training
Next, we use the extracted camera motion to train a network to associate image frames with directions towards different targets
Instead of filming every possible path, we recorded corridor segments and intersections paths once, then automatically recombined them using a graph-based algorithm. This created a rich set of synthetic navigation paths, ensuring coverage of all start-to-destination pairs.
Visual representations and navigation demos
We filmed real video footage inside a large shopping mall using a standard smartphone, held naturally at shoulder height while walking. This gave us realistic, slightly shaky input data that resembles what an end-user would capture. To establish ground truth directions, we estimated the phone’s 3D motion using only vision.
We combined optical flow and monocular depth and solved for camera rotation with a least-squares method. The key parameter we extract is the yaw rotation (turning left or right). The raw motion signal is noisy, so we smoothed it and converted it into 8 discrete classes. This gave us clean labels for training. We trained a lightweight convolutional neural network on image sequences. We used data augmentation, explainability-guided masking, and curriculum learning to make the model robust in crowded scenes. The trained model predicts directions in real time from the phone’s camera feed. We implemented this with the view of a mobile application with a simple compass-like interface. The final compass guidance is reliable throughout the mall even in moderately crowded areas and after scene modifications.
Most navigation tools fail the moment you step indoors. GPS signals are weak or unavailable inside shopping malls, airports, or subway stations, leaving users without reliable guidance. Traditional indoor navigation solutions often require extra infrastructure like WiFi beacons, Bluetooth sensors, or custom markers - making them costly, complex, and difficult to deploy at scale.
Our approach avoids all of that. In the demo on the left, you can see reliable, real-time guidance powered solely by the phone’s camera. No GPS, no special sensors, no internet connection - just vision. This makes our solution lightweight, affordable, and instantly deployable in any large indoor environment.
Below we make our entire Mall Navigation dataset and the Code publicly available
Inside Knowledge Pipeline Codebase - link here
Mall Navigation dataset - link here
Contact
For any questions regarding the usage of the code and the dataset please contact us:
Daniel Airinei - airineidaniel23@gmail.com
Elena Burceanu - eburceanu@bitdefender.com
Marius Leordeanu - leordeanu@gmail.com
TBD
This work is supported in part by projects “Romanian Hub for Artificial Intelligence - HRIA”, Smart Growth, Digitiza- tion and Financial Instruments Program, 2021-2027 (MyS- MIS no. 334906) and ”European Lighthouse of AI for Sus- tainability - ELIAS”, Horizon Europe program (Grant No. 101120237).