General Geospatial Inference with a Population Dynamics Foundation Model

Abstract:
Supporting the health and well-being of dynamic populations around the world requires governmental agencies, organizations and researchers to understand and reason over complex relationships between human behavior and local contexts in order to identify high-risk groups and strategically allocate limited resources. Traditional approaches to these classes of problems often entail developing manually curated, task-specific features and models to represent human behavior and the natural and built environment, which can be challenging to adapt to new, or even, related tasks. To address this, we introduce a Population Dynamics Foundation Model (PDFM) that aims to capture the relationships between diverse data modalities and is applicable to a broad range of geospatial tasks. We first construct a geo-indexed dataset for postal codes and counties across the United States, capturing rich aggregated information on human behavior from maps, busyness, and aggregated search trends, and environmental factors such as weather and air quality. We then model this data and the complex relationships between locations using a graph neural network, producing embeddings that can be adapted to a wide range of downstream tasks using relatively simple models. We evaluate the effectiveness of our approach by benchmarking it on 27 downstream tasks spanning three distinct domains: health indicators, socioeconomic factors, and environmental measurements. The approach achieves state-of-the-art performance on all 27 geospatial interpolation tasks, and on 25 out of the 27 extrapolation and super-resolution tasks. We combined the PDFM with a state-of-the-art forecasting foundation model, TimesFM, to predict unemployment and poverty, achieving performance that surpasses fully supervised forecasting. The full set of embeddings and sample code are publicly available for researchers.

Bio:

Dr. Gautam Prasad is a Software Engineer in Google Research working on geospatial machine learning including the Population Dynamics Foundation Model and other work related to Factuality in LLMs. His focus is to address health, socioeconomic, environmental, and commercial related problems using novel techniques that leverage unique data sources. Previously, he worked on human related computer vision including emotion recognition, eye tracking, and gesture recognition. Prior to Google he studied brain connectivity patterns in health and disease using MRI and machine learning.

Summary:

Google’s work on population modeling
- 2011: research on predicting societal metrics using Google Trends data
  - E.g. Flu Trends, economic metrics
- Challenge: the way people search changes routinely
  - Models need to be retrained
  - E.g. Flu Trends model stopped being useful after a few years since it was not refreshed
- 2023: introduced the symptom search dataset
  - 300 symptoms that affect people globally
  - Important signal for COVID and Flu tracking globally
  - Continually re-trained on current search patterns
Current work: broadening work across human behavior domains
- WHO people are: demographics, health, wellbeing
- WHAT they do: economic, social, consumption
- WHY they do it: beliefs, values
- WHERE people are: distribution, migration, forced displacement
- HOW: environmental interactions, power dynamics
- PDFM: Population Dynamics Foundation Model
  - https://github.com/google-research/population-dynamics
  - https://research.google/blog/insights-into-population-dynamics-a-foundation-model-for-geospatial-inference/
Example: Diabetes prevalence super-resolution
- Given
  - county-level diabetes prevalence
  - spatially fine-grained embeddings of population (zip-code)
- Train model to predict embeddings to county diabetes prevalence
- Use it to infer features at finer resolution such that they add up correctly to county
PDFM structure
- Relevant population facts
- Train a graph neural network
- Produces a 300-dimensional feature vector
- Used for: interpolation, extrapolation, supe-resolution, now-casting, forecasting
Datasets:
- Aggregated search trends:
  - Top 1,000 US national search trends on July 2022
  - Balanced to ensure these are searched across many zip codes
  - Ignore query text, focus on histogram of counts
  - Observation: most popular queries capture the major dynamics of more niche dynamics like health symptoms
- Aggregated maps places
  - Top 1,192 points of interest categories from Google Maps in each location in 2024
  - Represented in >= 5% of zip codes
  - Aggregated place busyness: 683 metrics
- Weather & Air Quality: 45 statistics in July 2022
Trained an auto-encoded graph neural network
- Nodes: spatial regions
- Edges: distance, correlation data
- Loss function: predict the original data based on node’s state and graph neighboring nodes
  - Embedded vector: 313 dims
  - Separated into separate sub-losses: Search Trends, Maps&Busyness, Weather/AQI
Forecasting: TimeFM
- Transformer-based model trained in many time series
- Can effectively predict the future trends of uni-variate time series
  - Doesn’t incorporate geo-spatial reasoning
- PDFM+TimesFM:
  - Learned an adapter model on top of TimesFM
  - Take TimesFM prediction for a given zip code
  - Then learn a model that adjusts the prediction
Evaluation
- Benchmarks: health, socioeconomic, environment
- EarthEngine geospatial data: nighttime lights, tree coverage
- Data commons: aggregates census statistics across the world
- Comparison:
  - Inverse distance weiting: interpolate data at point by interpolating from neary points
  - SatCLIP: Neural embeddings of satellite data
  - GeoCLIP: Neural embeddings of geo-tagged personal photography
- PDFM-based prediction is best for social metrics, at state of art for environmental metrics
- Augmenting PDFM with SatCLIP generally improves performance but in some cases extrapolation performance drops when using both
Applications:
- Sust Global: Populous: Trying to predict insurance premiums using AI
- CARTO: Cloud-Based Location Intelligence Platform
- GroupM: Model to help understand media performance insights
- Cooper/Smith: Disease tracking in low-resource environments
- UN AI4Good: Housing Prices + Night Time Lights Tutorial
- Geospatial reasoning