#50 papers challenge

My research New Year resolution is to read more papers. To be consistent with this, I decided to post a few lines here. I will face (at least) one paper per week.


I will look for interesting (and, possibly, not so widespread) papers, facing AI4EO challenges. I will also highlight what I appreciate the most.

My target is to cover 50 papers this year!

#21 SLR


Geospatial Foundation Models (GFMs) are commonly trained on large optical RGB or multi-spectral datasets, although data from various heterogeneous sensors are available in the remote sensing domain.

This leads to a great shift between pre-training and downstream data distributions.


Also, fine-tuning GFMs, to bridge this gap, is computation-intensive and can be ineffective when target datasets are small.


In this paper (at #CVPR2024), the authors present Scaled Low Rank (SLR), a self-supervised adaptation method that boosts downstream linear evaluation accuracy of different GFMs, outperforming full fine-tuning while training only 1-2% of the model parameters.


Note: the paper focuses only on patch-level classification

Paper: https://openaccess.thecvf.com//content/CVPR2024/papers/Scheibenreif_Parameter_Efficient_Self-Supervised_Geospatial_Domain_Adaptation_CVPR_2024_paper.pdf 


🚀📢new preprint! (#20) X-STARS


Large pre-trained models should learn “sensor agnostic” representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolution imagery (e.g. Sentinel-2 and Landsat-8), are available in large amounts, while very high-resolution data is less common. 


We introduce cross-sensor self-supervised training and alignment for remote sensing (X-STARS). We design a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD), to align representations across sensors, even with vastly different resolutions. 


Our X-STARS can be applied to train models from scratch, or to adapt large models pretrained on e.g. low-resolution EO data to new high-resolution sensors, in a continual pretraining framework.


We collect and release MSC-France, a new multi-sensor dataset, on which we train our X-STARS models, then evaluated on different downstream classification and segmentation tasks. We demonstrate that X-STARS outperforms the state-of-the-art with less data across various conditions of data availability and resolutions.

 

🔼 not another MAE

🔼 interesting performance with small data and continual pre-training

🔽 more modalities to be included


Paper: https://arxiv.org/abs/2405.09922


#19 MagicBathyNet

Climate change is strongly impacting on marine environments.


Accurate, detailed, and regularly updated bathymetry and complex semantic content are crucial for the undermapped shallow water areas facing intense climatological and anthropogenic pressures.


MagicBathyNet is a benchmark dataset comprising image patches of Sentinel-2, SPOT-6, and aerial imagery, bathymetry in raster format, and annotations of seabed classes. 


MagicBathyNet is then exploited to benchmark state-of-the-art methods in learning-based bathymetry and pixel-based classification.

Paper: https://arxiv.org/pdf/2405.15477 

#18 ChannelViT


Earth Observation images present different challenges w.r.t. natural images. One is the varying information coming with different bands (i.e. channels). Adapting common computer vision models to these new challenges is pivotal.


ChannelViT (#ICLR2024) constructs patch tokens independently from each input channel. It employs a set of learnable channel embeddings to encode channel-specific information, enabling ChannelViT to perform cross-channel and cross-position reasoning.


The authors introduce also Hierarchical Channel Sampling (HCS), which employs a two-step sampling procedure to simulate test time channel unavailability during training. Unlike channel dropout, HCS covers channel combinations with varying numbers of channels uniformly, boosting robustness.


a note: cool paper, but poor evaluation on remotely sensed images


paper: https://arxiv.org/abs/2309.16108 


#17 OmniSat


One of the main characteristics of foundation models should be multimodality.


This paper directly addresses this point, proposing a novel Self-Supervised Modality Fusion model: OmniSat.


OmniSat aligns multiple EO modalities to learn expressive multimodal representations without labels. It is made of both a contrastive and a reconstruction loss.


what I liked most: improved performance in downstream tasks also when just a modality is available


paper: https://arxiv.org/abs/2404.08351


#16 Alice goes to a differentiable wonderland!

 

My supervisor from my PhD published a great short book on neural networks, spanning from convolutions to transformers, SSMs, etc.

 

It is a great entry point for everyone who wants to approach deep learning and a great tool for those who want to refresh the theory.


It could be of great use to the GeoAI community!

 

What I liked most: the fluency, the clarity, and the promise of new topics

 

Book: https://www.sscardapane.it/alice-book 


#15 ODEED


Diffusion models are one of the most trendy topics in deep learning. They can greatly generate high-quality images under different conditions.

However, diffusion models can also be useful to detect out-of-distribution Earth Observation images!


In this paper, the authors show that the reconstruction error of diffusion models can effectively serve as unsupervised out-of-distribution detectors for remote sensing images, using it as a plausibility score.


Moreover, they introduce ODEED, a novel reconstruction-based scorer using the probability-flow ODE of diffusion models.


They validate their findings under different scenarios (e.g. pre/post flood and non-flooded/flooded images).


what I liked most: the innovative use of diffusion models and the downstream task
a note: curious to see how it adapts to other interesting tasks


paper: https://arxiv.org/abs/2404.12667


#14 FoMo-Bench


Forest monitoring is one of the most important tasks for Earth’s ecosystems. Deep learning can help towards this goal.

 

For this purpose, this paper proposes a Forest Monitoring Benchmark (FoMo-Bench), made of 15 diverse datasets encompassing satellite, aerial, and inventory data, covering a variety of geographical regions, and including multispectral, RGB, SAR, and LiDAR data with various temporal, spatial and spectral resolutions.

 

The authors also propose FoMo-Net, a baseline masked image modeling foundation model.

 

what I liked most: the important task

a note: the comparisons do not include other remote sensing pre-trained models


paper: https://arxiv.org/abs/2312.10114 



#13 Change Detection reality check


In recent years, there has been an explosion of proposed change detection deep learning architectures in the remote sensing literature.


But, has the field truly made significant progress?


In this paper, the authors perform experiments showing that U-Net is still a top performer.


what I liked most: the simple but important question

a note: the benchmarks are two with high-resolution datasets, maybe we should also try this approach with open low-resolution data?


paper: https://arxiv.org/abs/2402.06994 


#12 msGFM


msGFM is a multisensory geospatial foundation model that effectively unifies data from four key sensor modalities (RGB, S1, S2, DSM)


For data originating from the same geolocations, msGFM proposes a cross-sensor pretraining approach using MIM, enabling the synthesis of joint representations from diverse sensors


Notes:

1)  cool downstream tasks!

2)  MIM is almost always the default choice, what do you think?

3)  can RGB and S2 be considered “different modalities”?


paper: https://arxiv.org/abs/2404.01260



#11 SegEval


Choosing the proper ground sampling distance is a pivotal decision in remote sensing downstream applications.


In this work, the authors try to set out a clear ensemble of guidelines for fairly comparing semantic segmentation results obtained at various spatial resolutions.


They also propose region-based pixel-wise metrics, allowing for a more detailed analysis of the model performance.


what I liked: very practical and important problem 


paper: https://ieeexplore.ieee.org/document/10443941


#10 DOFA


Multimodality is one of the keys to remote sensing foundation models. Most of the methods, however, focus on just one of them.


DOFA employs an innovative approach utilizing wavelength as a unifying parameter across various EO modalities to achieve a more cohesive multimodal representation.


DOFA is trained using a masked image modeling strategy, and a distillation loss is included to further optimize its performance.


what I liked the most: the wavelength approach 


paper: https://arxiv.org/pdf/2403.15356.pdf


#9 PhilEO Bench


Almost every day a new foundation model is out. But, how to test their effectiveness?


PhilEO Bench proposes a new benchmark, based on Sentinel-2 images. It consists of three different tasks: land cover, building density estimation, road extraction


what I liked the most: the research question is pivotal

a note: this is just a first step to be extended 

paper: https://arxiv.org/abs/2401.04464 


#8 SatMAE++


SatMAE was one of the first strong foundation models, but it had some flaws.


To limit them, SatMAE++ performs a multiscale strategy and utilizes convolution-based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales.


what I liked most: the easy approach

a small note: we should start including segmentation in benchmarking

paper: https://arxiv.org/abs/2403.05419 


#7 SkySense


SkySense is a remote sensing foundation model (RSFM), trained on a dataset with 21.5 million multimodal temporal sequences.


It shows very promising performance, which confirms the idea that multimodality (both in sensors and in tempo-spatiality) is game-changing for RSFMs.


what I liked most: the Unsupervised Geo-Context Prototype Learning

a small note: what if the other models are trained with this new enormous dataset? 


paper: https://arxiv.org/pdf/2312.10115.pdf 



#6 xAI for EO


xAI methods are typically designed to work on natural images. However, the RS imagery has different properties than natural images.


The challenges are, thus, very different. For this reason, I found this work extremely interesting.


It summarizes all the work conducted in xAI for EO. It is a great entry point to this topic.


what I liked most: the research questions to guide the reader 


paper: https://arxiv.org/pdf/2402.13791


#5 LLMs for EO


There is a boost towards Large Language Models (LLMs) for EO.

Just from 17 January to 9 February, five papers (to my knowledge) came out. Four out of five mention GPT in the title.


Some trends are emerging:

a) the need for specialized multi-modal (not only vision-language) datasets

b) understanding the limitations of non-specialized models

c) giving domain-specific visual clues to the LLMs

d) adapting the models for specific downstream tasks 


Papers:

EarthGPT: https://arxiv.org/abs/2401.16822

RS ChatGPT: https://arxiv.org/abs/2401.09083

SkyEyeGPT: https://arxiv.org/abs/2401.09712

Rs-CapRet: https://arxiv.org/abs/2402.06475

Benchmarking GPT-4V: https://arxiv.org/abs/2401.17600


#4 SatML


It is increasingly evident that AI4EO challenges are quite different from the classical ML challenges.


Think, for example, of the different spatial and temporal scales, the data volume, the different channels, and the annotations.


This paper -SatML- tries to set a roadmap for a new agenda for satellite machine learning (a.k.a AI4EO), a new independent research line.


what I liked most: the focus on the importance of dense predictions

a little note: I think some of these challenges were already clear to geomatics specialists. I’d love a stronger and denser collaboration among geomatics-based and ML-based communities, working on this intersection

paper: https://arxiv.org/pdf/2402.01444  


#3 CSMAE


Using representations that are invariant to the sensor is a pivotal element in remote sensing.


Sensor agnosticism is one of the keys to modern AI4EO systems.

To this purpose, CSMAE (Cross-Sensor MAE) is an MAE-based algorithm for cross-sensor image retrieval.


Specifically, the authors explore four different approaches to shape a CSMAE, testing it on different Sentinel (both 1 and 2) data.


what I liked most: the sensor-agnostic approach

a little note: sensor-agnosticism =/= multimodality 

paper: https://arxiv.org/abs/2401.07782   


#2 Revisiting pre-trained remote sensing model benchmarks


Foundation (pre-trained) models are spreading rapidly in remote sensing (RS). However, as often, a big part of the expertise is taken off-the-shelf from computer vision, leaving little room for the peculiarities of RS.


A clear example -not covered in the paper- is that most foundation models for RS are mainly developed for RGB and/or high-resolution data, despite often the available RS data being multispectral and/or low-resolution (e.g. Sentinel-2).


This paper starts to investigate this issue: how to optimize pre-trained (foundation) models on different RS benchmarks, with a particular focus on image size and normalization.


Also, a big shout out to TorchGeo, the repo from the same authors!


what I liked most: the research question, really underestimated

a little note: I'd have loved more deepening on the multispectral side


paper: https://arxiv.org/pdf/2305.13456.pdf

torchgeo: https://torchgeo.readthedocs.io/en/stable/  


#1 SatCLIP


Over the globe, several environmental (e.g. temperature) and socioeconomic factors (e.g. population density) shape the visual characteristics of satellite imageries.


These factors manifest in some visual aspects, which can range from the different types of vegetation and agricultural parcels to the architectural design of buildings.


To grasp this correspondence, SatCLIP (Satellite Contrastive Location-Image Pretraining) learns to associate an image with a location, based on the various ground information detectable in the images.

In this way, SatCLIP learns the implicit representations of the image features that characterize a specific location.


The model, trained with SatCLIP, is then used to solve several spatially-aware downstream tasks (e.g. air temperature prediction, biome classification).


what I appreciated most: the experiments assessing performance per continent and geographic generalization (see Figure 3 and Section 5.2 in the paper)


paper: https://arxiv.org/pdf/2311.17179.pdf