Abstract:
Remote sensing foundation models must learn from instruments that differ in physics, spatial scale, coverage, temporal cadence, and information content. These characteristics challenge assumptions underlying mainstream natural image and language models, demanding new architectural and training strategies. In this talk, I will discuss foundation models designed for the unique opportunities and challenges of Earth and for Mars remote sensing data. These models adopt different approaches to multimodal pretraining shaped by their distinct data regimes and downstream application objectives. Through this comparison, I will discuss how data heterogeneity impacts representation learning approaches and suggest new directions for multimodal foundation models that go beyond natural images and language.
Bio:
Hannah Kerner is an Assistant Professor in the School of Computing and Augmented Intelligence at Arizona State University. Her research focuses on advancing the foundations and applications of machine learning to foster a more sustainable, responsible, and fair future for all. Her lab is conducting research projects in machine learning for remote sensing, algorithmic bias, and machine learning theory. She translates research advances to real-world impact through her roles as the AI/Machine Learning Lead for NASA Harvest and NASA Acres, Center Faculty for the ASU Center for Global Discovery and Conservation Science (GDCS), and Research Director for Taylor Geospatial. She has been recognized by multiple prestigious research awards including NSF CAREER (2025), Schmidt Sciences AI2050 Early Career Fellowship (2025), and Forbes 30 Under 30 in Science (2021).
Summary:
Focus: Geospatial foundation models on Earth and Mars
Data required to support decision-making: agriculture, ecosystems, disasters
Common problem format:
Input:
Sparse observational data of surface properties (on-ground measurements, survey data, manual annotations)
Wall-to-wall remote-sensing data
Output: wall-to-wall maps of inferred surface features
Challenge: creating end-to-end pipelines to do this inference for many features of interest to stakeholders
Foundation models make these processing tasks much simpler
Compress multi-modal/multi-sensor observations into a compact latent space
Can create predictive models of surface features given these latent vectors
Very expensive to create, very cheap to use
Need to be as flexible as possible to accommodate diverse user-cases
Have developed family of Earth Foundation models
Presto: Pre-trained remote sensing transformer
Upto 5 sensors/data sources: 15 dynamics channels + 5 static variables
Location, elevation, dynamic world, precipitation, Sentinel-1, Sentinel-2 RGB
Globally diverse data
Flexible to missing data points via random masking
Fairly small model: .4M params.
Easy to routinely run for individual teams
Fast to fine-tune for individual tasks
Challenge: does not incorporate spatial inputs
Galileo: Global and Local Flexible Earth Observation models
Natively handles different shapes
Upto 9 sensors/data sources
Flexible input shape
Combination of masked reconstruction and contrastics learning
Good performance across all scales e.g. regional (coast line, forest) and local (tree, cow)
Global loss with variable exit encoder (model can use latent features from earlier levels of the encoding stack; 2 exit points)
Local loss with shallow encodings
Smaller, computationally efficient models than others with comparable performance
Challenges: unstable training, hard to use
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
Modeling innovation
Extensive improvement in testing and analysis
Upto 9 sensors/data sources, including maps and derived products in addition to raw observations
More stable training by replacing a learned projection with a random projection
Contrastive loss focuses on hard negatives to ensure model can’t leak information across modalities
Accessible platform: olmoearth.allenai.org
Mars Foundation Models
Far less data: modalities, variable resolution, time period, spatial coverage
Approach: task arithmetic
Independent model for each sensor
Combine the outputs of the models via adding
MOMO: Mars Orbital Model
Novel strategy:
Train sub-models until they have similar loss values
Then take those model checkpoints and combine them into single model
Observations:
Multi-modal learning is a lot more complex than just adding more modalities
Simple model architectures lead to naive solutions
Pre-training objectives should incorporate data structure and complementarity
Local vs global complementarity
Multi-modal: different views of same location
Multi-model pretraining depends on missingness of the data