Multimodal Pretraining Under Sensor Diversity: Lessons Learned from Foundation Models for Earth and Mars

Abstract:
Remote sensing foundation models must learn from instruments that differ in physics, spatial scale, coverage, temporal cadence, and information content. These characteristics challenge assumptions underlying mainstream natural image and language models, demanding new architectural and training strategies. In this talk, I will discuss foundation models designed for the unique opportunities and challenges of Earth and for Mars remote sensing data. These models adopt different approaches to multimodal pretraining shaped by their distinct data regimes and downstream application objectives. Through this comparison, I will discuss how data heterogeneity impacts representation learning approaches and suggest new directions for multimodal foundation models that go beyond natural images and language.

Bio:

Hannah Kerner is an Assistant Professor in the School of Computing and Augmented Intelligence at Arizona State University. Her research focuses on advancing the foundations and applications of machine learning to foster a more sustainable, responsible, and fair future for all. Her lab is conducting research projects in machine learning for remote sensing, algorithmic bias, and machine learning theory. She translates research advances to real-world impact through her roles as the AI/Machine Learning Lead for NASA Harvest and NASA Acres, Center Faculty for the ASU Center for Global Discovery and Conservation Science (GDCS), and Research Director for Taylor Geospatial. She has been recognized by multiple prestigious research awards including NSF CAREER (2025), Schmidt Sciences AI2050 Early Career Fellowship (2025), and Forbes 30 Under 30 in Science (2021).

Summary:

Focus: Geospatial foundation models on Earth and Mars
Data required to support decision-making: agriculture, ecosystems, disasters
Common problem format:
- Input:
  - Sparse observational data of surface properties (on-ground measurements, survey data, manual annotations)
  - Wall-to-wall remote-sensing data
- Output: wall-to-wall maps of inferred surface features
Challenge: creating end-to-end pipelines to do this inference for many features of interest to stakeholders
Foundation models make these processing tasks much simpler
- Compress multi-modal/multi-sensor observations into a compact latent space
- Can create predictive models of surface features given these latent vectors
- Very expensive to create, very cheap to use
- Need to be as flexible as possible to accommodate diverse user-cases
Have developed family of Earth Foundation models
Presto: Pre-trained remote sensing transformer
- Upto 5 sensors/data sources: 15 dynamics channels + 5 static variables
- Location, elevation, dynamic world, precipitation, Sentinel-1, Sentinel-2 RGB
- Globally diverse data
- Flexible to missing data points via random masking
- Fairly small model: .4M params.
  - Easy to routinely run for individual teams
  - Fast to fine-tune for individual tasks
- Challenge: does not incorporate spatial inputs
Galileo: Global and Local Flexible Earth Observation models
- Natively handles different shapes
- Upto 9 sensors/data sources
- Flexible input shape
- Combination of masked reconstruction and contrastics learning
- Good performance across all scales e.g. regional (coast line, forest) and local (tree, cow)
  - Global loss with variable exit encoder (model can use latent features from earlier levels of the encoding stack; 2 exit points)
  - Local loss with shallow encodings
- Smaller, computationally efficient models than others with comparable performance
- Challenges: unstable training, hard to use
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
- Modeling innovation
- Extensive improvement in testing and analysis
- Upto 9 sensors/data sources, including maps and derived products in addition to raw observations
- More stable training by replacing a learned projection with a random projection
- Contrastive loss focuses on hard negatives to ensure model can’t leak information across modalities
- Accessible platform: olmoearth.allenai.org
Mars Foundation Models
- Far less data: modalities, variable resolution, time period, spatial coverage
- Approach: task arithmetic
  - Independent model for each sensor
  - Combine the outputs of the models via adding
  - MOMO: Mars Orbital Model
- Novel strategy:
  - Train sub-models until they have similar loss values
  - Then take those model checkpoints and combine them into single model
Observations:
- Multi-modal learning is a lot more complex than just adding more modalities
  - Simple model architectures lead to naive solutions
  - Pre-training objectives should incorporate data structure and complementarity
    - Local vs global complementarity
    - Multi-modal: different views of same location
- Multi-model pretraining depends on missingness of the data

Page updated

Report abuse