My research New Year resolution is to read more papers. To be consistent with this, I decided to post a few lines here. I will face (at least) one paper per week.
I will look for interesting (and, possibly, not so widespread) papers, facing AI4EO challenges. I will also highlight what I appreciate the most.
My target is to cover 50 papers this year!
I am not a great expert of Graph Neural Networks (GNNs), but I found this paper very interesting, especially as entry point.
The paper explores the application of GNNs to Earth Observation (EO) data, emphasizing their suitability for irregular, heterogeneous, and multi-source data (e.g. point clouds).
It provides a comprehensive review of GNNs across various EO applications, such as weather forecasting, disaster management, and environmental monitoring, while highlighting the methodological innovations and challenges unique to the EO domain.
⬆️: modeling non-Euclidean spatial structures
⬇️: domain-specific graph construction strategies?
Are geospatial foundation models really impactful?
Check it in our new pre-print!
Welcome to PANGAEA: a global and inclusive benchmark for geospatial foundation models
https://arxiv.org/abs/2412.04204
Check also the public GitHub repo (other news/updates soon):
https://github.com/VMarsocci/pangaea-bench/
We collected 11 datasets to create an inclusive, diverse benchmarks, based on these criteria:
application domain
geographical distribution
type of task
modality
temporality
Spoiler: no patch-level classification tasks are included!
With this benchmark (PANGAEA), we tried to address the following research challenges:
provide a robust evaluation protocol to benchmark GFMs
investigate GFMs capabilities, with a focus on
a) domain generalization,
b) comparison to supervised baselines,
c) performance with limited labels
We observed interesting insights, such as:
generally speaking, GFMs don't really excel when compared to supervised baselines
for some specific scenarios (e.g. HR data), it makes sense to use GFMs
multi-temporal data are still under-estimated
Check many other insights in the paper!
This paper investigates the performance trade-offs between global and local machine learning models for geospatial tasks, using tree canopy height (TCH) mapping in the Karingani Game Reserve, Mozambique, as a case study. The findings reveal that models trained exclusively on local data outperform global models and even globally pre-trained models fine-tuned with local data.
⬆️: interesting set of research questions
⬇️: what about "generalist" geospatial foundation models?
Paper: https://lnkd.in/d73db-JB
This paper investigates the effectiveness of specialized foundation models (FMs) in genomics, satellite imaging, and time series domains, showing that traditional supervised learning pipelines—when well-tuned—often outperform or match the performance of these FMs despite their reliance on massive pretraining datasets and extensive computational resources.
In the Figure, you can see the focus on satellite experiments.
⬆️: very relevant work
⬇️: just classification, limiting the real-world capabilities*
The paper presents ALISE (ALigned SITS Encoder), a novel model for processing irregular and unaligned Satellite Image Time Series (SITS).
It produces aligned, fixed-size representations while preserving spatial resolution, enabling multi-task applications such as land cover segmentation, crop monitoring, and change detection with minimal labeled data.
⬆️: great sparse data and labels handling
⬇️: would be great to extend the domains (geographical and sensor-related)
TaxaBind is a multimodal framework that creates a unified embedding space across six ecological data modalities—ground-level images, geographic location, satellite images, text, audio, and environmental features—allowing for improved performance in ecological tasks like species classification, distribution mapping, and cross-modal retrieval.
⬆️: robust zero-shot classification
⬇️: what if we want to add/test new tasks?
Prithvi WxC is a 2.3 billion-parameter AI foundation model designed for weather and climate forecasting, integrating 160 variables from MERRA-2 data to handle diverse weather tasks like forecasting, downscaling, and extreme event estimation.
Using a transformer-based encoder-decoder architecture, it bridges the gap between AI foundation models and traditional weather models, with its pretrained version and fine-tuning workflows available open-source on Hugging Face.
⬆️: interesting benchmark tasks!
⬇️: too few comparisons
Paper: https://arxiv.org/abs/2409.13598
Satellite Metadata-Image Pretraining (SatMIP) is a pretraining approach using metadata in a multimodal learning objective.
SatMIP represents metadata as textual captions and aligns images with metadata in a shared embedding space by solving a metadata-image contrastive task.
The authors also propose SatMIPS, which combines image self-supervision and metadata supervision.
⬆️: great use of metadata
⬇️: no dense tasks in the benchmark
Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/03849.pdf
UrBench is a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios.
UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding.
Evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects.
⬆️: great research question
⬇️: I would have pushed more on domain shift
Project page: https://opendatalab.github.io/UrBench/
This paper investigates the ability of geoFMs to transfer to new geographic regions in the agricultural domain, where differences in farming practices and class imbalance make transfer learning particularly challenging
⬆️: the pivotal topic (especially for real-world applications)
⬇️: the limited number of geoFMs
Locate Anything on Earth proposes an LAE-Label Engine to annotate a large-scale dataset for remote sensing object detection.
It also proposes an open-vocabulary foundation object detector, based on DINO: LAE-DINO.
⬆️: experiments are very convincing in many aspects
⬇️: world cloud figures are totally outdated
After a summer break, here we are again surveying papers.
Today’s paper provides a comprehensive survey of FMs in RS, covering FMs released between June 2021 and June 2024.
The authors categorize these models based on their applications and domain-specific tasks.
⬆️: good entry point
⬇️: the results are just reported from other papers, limiting a fair comparison among different FMs
SpectralGPT (#TPAMI) is an MAE-based foundation model, capable of accommodating input images with varying sizes, resolutions, time series, and regions in a progressive training fashion, leveraging 3D token generation for spatial-spectral coupling.
⬆️: great flexibility in the input
⬇️: it misses a small effort towards multimodality (e.g. SAR)
RemoteCLIP (#TGRS) is one of the first vision-language foundation models for remote sensing that aligns text embedding to learn robust visual features with rich semantics. It is tested on a wide variety of tasks.
⬆️: using vision-language
⬇️: not extended to multispectral
GFM (#ICCV ’23) proposes a multi-objective continual pretraining paradigm, which leverages the strong representations of ImageNet while simultaneously providing the freedom to learn remote sensing in-domain features.
⬆️: continual pre-training and carbon footprint assessment
⬇️: limited in the modalities (just RGB)
Prithvi is the geospatial foundation model developed by NASA. Trained on HLS time-series, it employs an MAE with 3D positional encoding, to effectively consider multi-temporality, a pivotal characteristic for remote sensing.
⬆️: addressing multi-temporality; the downstream tasks
⬇️: limited in the geographical extent
Satlas (#ICCV ‘23) proposes both a dataset (SatlasPretrain) and a model (SatlasNet). SatlasNet is one of the very few general-purpose models, with a supervised setting. It is a Swin-based model with multi-head for different tasks.
⬆️: multi-task model
⬇️: supervised setting
Scale-MAE (#ICCV ‘23) is an MAE invariant to different sensors’ resolutions.
It proposes a positional encoding scaled on the input resolution and a bandpass filter decoder to reconstruct low/high-frequency images at lower/higher scales.
⬆️: addressing multi-resolution
⬇️: working only with RGB
CROMA (#NeurIPS ‘23) was one of the first cross-modal geospatial self-supervised models. Aligning through contrastive learning different modalities, CROMA reaches great results on classification and segmentation.
⬆️: pioneer work (especially for multimodality)
⬇️: working just with tempo-spatial aligned data
SSL4EO-S12 (#GRSM ‘23) is a great benchmark. Four different computer vision models (MoCo, MAE, DINO and data2vec) are trained on a vast S1 and S2 dataset.
⬆️ : many interesting experiments for a great benchmark
⬇️ : no custom model
Geospatial Foundation Models (GFMs) are commonly trained on large optical RGB or multi-spectral datasets, although data from various heterogeneous sensors are available in the remote sensing domain.
This leads to a great shift between pre-training and downstream data distributions.
Also, fine-tuning GFMs, to bridge this gap, is computation-intensive and can be ineffective when target datasets are small.
In this paper (at #CVPR2024), the authors present Scaled Low Rank (SLR), a self-supervised adaptation method that boosts downstream linear evaluation accuracy of different GFMs, outperforming full fine-tuning while training only 1-2% of the model parameters.
Note: the paper focuses only on patch-level classification
Large pre-trained models should learn “sensor agnostic” representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolution imagery (e.g. Sentinel-2 and Landsat-8), are available in large amounts, while very high-resolution data is less common.
We introduce cross-sensor self-supervised training and alignment for remote sensing (X-STARS). We design a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD), to align representations across sensors, even with vastly different resolutions.
Our X-STARS can be applied to train models from scratch, or to adapt large models pretrained on e.g. low-resolution EO data to new high-resolution sensors, in a continual pretraining framework.
We collect and release MSC-France, a new multi-sensor dataset, on which we train our X-STARS models, then evaluated on different downstream classification and segmentation tasks. We demonstrate that X-STARS outperforms the state-of-the-art with less data across various conditions of data availability and resolutions.
🔼 not another MAE
🔼 interesting performance with small data and continual pre-training
🔽 more modalities to be included
Paper: https://arxiv.org/abs/2405.09922
Climate change is strongly impacting on marine environments.
Accurate, detailed, and regularly updated bathymetry and complex semantic content are crucial for the undermapped shallow water areas facing intense climatological and anthropogenic pressures.
MagicBathyNet is a benchmark dataset comprising image patches of Sentinel-2, SPOT-6, and aerial imagery, bathymetry in raster format, and annotations of seabed classes.
MagicBathyNet is then exploited to benchmark state-of-the-art methods in learning-based bathymetry and pixel-based classification.
Earth Observation images present different challenges w.r.t. natural images. One is the varying information coming with different bands (i.e. channels). Adapting common computer vision models to these new challenges is pivotal.
ChannelViT (#ICLR2024) constructs patch tokens independently from each input channel. It employs a set of learnable channel embeddings to encode channel-specific information, enabling ChannelViT to perform cross-channel and cross-position reasoning.
The authors introduce also Hierarchical Channel Sampling (HCS), which employs a two-step sampling procedure to simulate test time channel unavailability during training. Unlike channel dropout, HCS covers channel combinations with varying numbers of channels uniformly, boosting robustness.
a note: cool paper, but poor evaluation on remotely sensed images
paper: https://arxiv.org/abs/2309.16108
One of the main characteristics of foundation models should be multimodality.
This paper directly addresses this point, proposing a novel Self-Supervised Modality Fusion model: OmniSat.
OmniSat aligns multiple EO modalities to learn expressive multimodal representations without labels. It is made of both a contrastive and a reconstruction loss.
what I liked most: improved performance in downstream tasks also when just a modality is available
My supervisor from my PhD published a great short book on neural networks, spanning from convolutions to transformers, SSMs, etc.
It is a great entry point for everyone who wants to approach deep learning and a great tool for those who want to refresh the theory.
It could be of great use to the GeoAI community!
What I liked most: the fluency, the clarity, and the promise of new topics
Diffusion models are one of the most trendy topics in deep learning. They can greatly generate high-quality images under different conditions.
However, diffusion models can also be useful to detect out-of-distribution Earth Observation images!
In this paper, the authors show that the reconstruction error of diffusion models can effectively serve as unsupervised out-of-distribution detectors for remote sensing images, using it as a plausibility score.
Moreover, they introduce ODEED, a novel reconstruction-based scorer using the probability-flow ODE of diffusion models.
They validate their findings under different scenarios (e.g. pre/post flood and non-flooded/flooded images).
what I liked most: the innovative use of diffusion models and the downstream task
a note: curious to see how it adapts to other interesting tasks
Forest monitoring is one of the most important tasks for Earth’s ecosystems. Deep learning can help towards this goal.
For this purpose, this paper proposes a Forest Monitoring Benchmark (FoMo-Bench), made of 15 diverse datasets encompassing satellite, aerial, and inventory data, covering a variety of geographical regions, and including multispectral, RGB, SAR, and LiDAR data with various temporal, spatial and spectral resolutions.
The authors also propose FoMo-Net, a baseline masked image modeling foundation model.
what I liked most: the important task
a note: the comparisons do not include other remote sensing pre-trained models
paper: https://arxiv.org/abs/2312.10114
In recent years, there has been an explosion of proposed change detection deep learning architectures in the remote sensing literature.
But, has the field truly made significant progress?
In this paper, the authors perform experiments showing that U-Net is still a top performer.
what I liked most: the simple but important question
a note: the benchmarks are two with high-resolution datasets, maybe we should also try this approach with open low-resolution data?
msGFM is a multisensory geospatial foundation model that effectively unifies data from four key sensor modalities (RGB, S1, S2, DSM)
For data originating from the same geolocations, msGFM proposes a cross-sensor pretraining approach using MIM, enabling the synthesis of joint representations from diverse sensors
Notes:
1) cool downstream tasks!
2) MIM is almost always the default choice, what do you think?
3) can RGB and S2 be considered “different modalities”?
paper: https://arxiv.org/abs/2404.01260
Choosing the proper ground sampling distance is a pivotal decision in remote sensing downstream applications.
In this work, the authors try to set out a clear ensemble of guidelines for fairly comparing semantic segmentation results obtained at various spatial resolutions.
They also propose region-based pixel-wise metrics, allowing for a more detailed analysis of the model performance.
what I liked: very practical and important problem
Multimodality is one of the keys to remote sensing foundation models. Most of the methods, however, focus on just one of them.
DOFA employs an innovative approach utilizing wavelength as a unifying parameter across various EO modalities to achieve a more cohesive multimodal representation.
DOFA is trained using a masked image modeling strategy, and a distillation loss is included to further optimize its performance.
what I liked the most: the wavelength approach
Almost every day a new foundation model is out. But, how to test their effectiveness?
PhilEO Bench proposes a new benchmark, based on Sentinel-2 images. It consists of three different tasks: land cover, building density estimation, road extraction
what I liked the most: the research question is pivotal
a note: this is just a first step to be extended
SatMAE was one of the first strong foundation models, but it had some flaws.
To limit them, SatMAE++ performs a multiscale strategy and utilizes convolution-based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales.
what I liked most: the easy approach
a small note: we should start including segmentation in benchmarking
SkySense is a remote sensing foundation model (RSFM), trained on a dataset with 21.5 million multimodal temporal sequences.
It shows very promising performance, which confirms the idea that multimodality (both in sensors and in tempo-spatiality) is game-changing for RSFMs.
what I liked most: the Unsupervised Geo-Context Prototype Learning
a small note: what if the other models are trained with this new enormous dataset?
paper: https://arxiv.org/pdf/2312.10115.pdf
xAI methods are typically designed to work on natural images. However, the RS imagery has different properties than natural images.
The challenges are, thus, very different. For this reason, I found this work extremely interesting.
It summarizes all the work conducted in xAI for EO. It is a great entry point to this topic.
what I liked most: the research questions to guide the reader
There is a boost towards Large Language Models (LLMs) for EO.
Just from 17 January to 9 February, five papers (to my knowledge) came out. Four out of five mention GPT in the title.
Some trends are emerging:
a) the need for specialized multi-modal (not only vision-language) datasets
b) understanding the limitations of non-specialized models
c) giving domain-specific visual clues to the LLMs
d) adapting the models for specific downstream tasks
Papers:
EarthGPT: https://arxiv.org/abs/2401.16822
RS ChatGPT: https://arxiv.org/abs/2401.09083
SkyEyeGPT: https://arxiv.org/abs/2401.09712
Rs-CapRet: https://arxiv.org/abs/2402.06475
Benchmarking GPT-4V: https://arxiv.org/abs/2401.17600
It is increasingly evident that AI4EO challenges are quite different from the classical ML challenges.
Think, for example, of the different spatial and temporal scales, the data volume, the different channels, and the annotations.
This paper -SatML- tries to set a roadmap for a new agenda for satellite machine learning (a.k.a AI4EO), a new independent research line.
what I liked most: the focus on the importance of dense predictions
a little note: I think some of these challenges were already clear to geomatics specialists. I’d love a stronger and denser collaboration among geomatics-based and ML-based communities, working on this intersection
Using representations that are invariant to the sensor is a pivotal element in remote sensing.
Sensor agnosticism is one of the keys to modern AI4EO systems.
To this purpose, CSMAE (Cross-Sensor MAE) is an MAE-based algorithm for cross-sensor image retrieval.
Specifically, the authors explore four different approaches to shape a CSMAE, testing it on different Sentinel (both 1 and 2) data.
what I liked most: the sensor-agnostic approach
a little note: sensor-agnosticism =/= multimodality
Foundation (pre-trained) models are spreading rapidly in remote sensing (RS). However, as often, a big part of the expertise is taken off-the-shelf from computer vision, leaving little room for the peculiarities of RS.
A clear example -not covered in the paper- is that most foundation models for RS are mainly developed for RGB and/or high-resolution data, despite often the available RS data being multispectral and/or low-resolution (e.g. Sentinel-2).
This paper starts to investigate this issue: how to optimize pre-trained (foundation) models on different RS benchmarks, with a particular focus on image size and normalization.
Also, a big shout out to TorchGeo, the repo from the same authors!
what I liked most: the research question, really underestimated
a little note: I'd have loved more deepening on the multispectral side
Over the globe, several environmental (e.g. temperature) and socioeconomic factors (e.g. population density) shape the visual characteristics of satellite imageries.
These factors manifest in some visual aspects, which can range from the different types of vegetation and agricultural parcels to the architectural design of buildings.
To grasp this correspondence, SatCLIP (Satellite Contrastive Location-Image Pretraining) learns to associate an image with a location, based on the various ground information detectable in the images.
In this way, SatCLIP learns the implicit representations of the image features that characterize a specific location.
The model, trained with SatCLIP, is then used to solve several spatially-aware downstream tasks (e.g. air temperature prediction, biome classification).
what I appreciated most: the experiments assessing performance per continent and geographic generalization (see Figure 3 and Section 5.2 in the paper)