In the age of ubiquitous smartphone use and widespread image sharing on social platforms, geolocation poses a critical privacy concern. Images often carry sensitive spatial and temporal details—such as street signs, architectural styles, or timestamps—that can inadvertently disclose the precise whereabouts of individuals and organizations. Recent advances in large vision-language models (LVLMs) amplify these risks by enabling any user, regardless of technical expertise, to extract and exploit location cues from seemingly benign photos. Although a number of AI-driven geolocation solutions currently exist, most focus on narrow datasets or specialized application contexts, leaving critical questions unanswered about generalizable performance, security threats, and privacy implications in real-world settings.
In this work, we first evaluate the geolocation capabilities of state-of-the-art LVLMs, benchmarking them against established solutions using a newly compiled, comprehensive dataset of 50,000 images. We then introduce ETHAN, a novel framework that integrates chain-of-thought (CoT) reasoning to mimic expert human geoguessing. ETHAN systematically identifies salient elements—architectural motifs, cultural markers, and environmental cues—to achieve best-in-class accuracy (e.g., 28.7% at the 1km threshold). Moreover, it demonstrates robust performance against human players on the *GeoGuessr* platform, with an 85.4% win rate. Our findings highlight both the potent threat of LVLM-based geolocation to individual privacy and the potential for responsible, more transparent AI-driven frameworks that protect sensitive location data.
At the heart of our concerns are two main assets: users' personal location data and the visual content of photos. Personal location data includes sensitive places such as homes, workplaces, or private gathering venues. The visual content encompasses identifiable elements such as landmarks, street signs, and unique features, all of which can potentially disclose a photo's location.
Various actors exploit these vulnerabilities, ranging from malicious individuals to automated systems. These actors deploy LVLMs to analyze the visual content of photos and infer locations, using these insights for potentially harmful purposes.
The primary method employed by these threat actors involves sophisticated image analysis techniques. LVLMs, trained on diverse image datasets, excel at recognizing and interpreting visual cues such as architectural styles, signage, and natural landscapes to pinpoint locations. This capability is often enhanced by correlating LVLM inferences with publicly available geographic data, thus improving the accuracy of location predictions.
Our new dataset is meticulously designed to address and overcome the biases and dataset leakage issues found in prior empirical studies, focusing on fairness and comprehensive geographic representation. Data are collected from diverse global locations and verified through multiple credible sources, ensuring accuracy and relevance. The sampling strategy, based on the proportional area of countries, ensures balanced representation across different regions, covering urban and remote areas alike. To maintain dataset integrity, indoor images are filtered out using a novel method of image comparison across different rotational viewpoints, classifying sets with cosine similarity exceeding 0.8 as indoor. This filtering process, based on advanced algorithms, ensures the dataset provides a robust foundation for effective LVLM training. By focusing on outdoor images and adjusting for geographic distribution, we equip LVLMs with diverse visual inputs necessary for accurate geolocation tasks across varied environmental settings and cultural contexts.
We introduce Ethan, an advanced framework leveraging LVLMs for automated geolocation by combining fine-tuning of LVLMs to handle real-world images and innovative chain-of-thought (CoT) prompting techniques to emulate geolocation problem-solving strategies used by expert geoguessors. Unlike traditional methods, Ethan leverages the LVLM's reasoning capabilities to deduce locations, learning from human experts who analyze environmental cues like vegetation, architecture, and signage. The CoT prompting guides the LVLM through a structured reasoning process, improving the accuracy and interpretability of its predictions. Our approach involves fine-tuning LVLMs with a dataset curated for geolocation, generating detailed image descriptions as training prompts. Ethan employs specific geolocation strategies, including analyzing vehicle characteristics, infrastructure elements, natural features, and cultural indicators. By integrating these diverse cues with geographical data, Ethan enhances its ability to pinpoint locations accurately. This systematic and transparent approach ensures high precision in diverse and complex geolocation scenarios, making Ethan a robust solution for automated geolocation tasks.