ML Group

Testing ideas for future work

We had a brainstorming session to come up with interesting dycore-style testcases for ML. We started implementing them, but this was put aside for future work as soon as Casper became available again.

Variables that can be prescribed in the models (e.g. SFNO):

At all vertical levels: p, T, q or RH, u, v, z
Column values: T2m, surface pressure, MSLP, TCWV

The testcases discussed are based on the 'tendency reversal' idea which allows the effect of the perturbation to be isolated. We would compare it to dynamical model output.

Testcases

2D uniform advection over the sphere. Advect a blob of tracer (represented through a moisture perturbation in an upper model level that behaves as a tracer) uniformly for a full revolution such that it would exactly return to its original state. Placing this either at or off the equator.
2D non-uniform advection over the sphere. E.g. colliding modons (Lin et al. 2017) or Blossey-Durran (2008). Set up the 2D flow similarly in the upper model level as for (1) with a moisture perturbation.
Warm bubble test to analyse how buoyancy-driven rising motions is represented. Through e.g. converging winds towards a high RH region of low pressure at the surface with high P aloft. Test over the Sahara (no/limited additional moisture source) or e.g. in the tropics over the ocean (large extra moisture source).
Testing the vortex street and gap flow (DCMIP2025 2a, 2b) by prescribing winds at different magnitude or direction around real orography (inherently in the ML model). We cannot prescribe orography, but we can make use of learned orography.

To quantify these errors correctly, it is important to understand the model precision error and the 'quality' of the steady state. Some things to look at are

How quickly do simulations without perturbations diverge from the initial steady state? Quantifying the speed of divergence from steady state (e.g., Bouvier et al 2024 steady state, DJF or JJA steady state) through the amplification of nonphysical MSLP anomalies for various models.

Earth 2 Studio

Overview

We are using Earth2Studio to investigate the performance of machine learning-based weather forecasting models. Our primary focus is comparing the output of Earth2Studio models against the operational GFS (Global Forecast System) over the past 4 days.

Objectives

Evaluate how well ML-based models reproduce near-term forecasts when compared to GFS.
Identify strengths and weaknesses of different model architectures in Earth2Studio.
Explore the impact of recent atmospheric events on model performance.

Progress

Due to technical difficulties, we’ve had limited time to run full-scale experiments. Nonetheless, we have:

Successfully set up Earth2Studio.
Selected forecast targets and initial conditions for the past 4 days.
Begun running inference using several pretrained models in the Earth2Studio suite.
Developed basic visualisation tools to compare GFS and ML outputs (e.g. temperature, precipitation, geopotential height at 500 hPa).
Discussions around potential future applications for E2S.

Early Observations

Initial side-by-side comparisons suggest that while the ML models capture large-scale features reasonably well, they tend to smooth out finer-scale structures and underestimate extremes in precipitation and wind. This may reflect either training data limitations or the model design itself.

Models

Pangu
FourCastNet
Deep Learning Weather Prediction (DLWP)

The possibilities with Earth2Studio for ML modelling

E2S was set up and running on multiple laptops, and google colab within an hour. Test cases were then run fairly quickly (sub 20 mins) with minimal compute power. This represents a powerful environment for model testing.

Model Benchmarking

Compare Earth2Studio ML models (like GraphCast) with traditional forecasts (e.g. GFS, ECMWF) across recent weather events - this is some of what we have explored.

Region Specific testing

Focus on specific geographic zones—such as the tropics or polar regions—to assess where ML models perform well or need improvement.

Exploring Uncertainty

Investigate ensemble outputs or stochastic models to visualise and quantify forecast uncertainty.

GFS_FX - Earth2Studio

Variable - Temperature 2Meter

GFS_FX - Earth2Studio

Geopotential Height at 500 hPa

DLWP Model - Earth2Studio

Variable - Temperature 2Meter

DLWP Model - Earth2Studio

Geopotential Height at 500 hPa

FourCastNet Model - Earth2Studio

Variable - Temperature 2Meter

FourCastNet Model - Earth2Studio

Geopotential Height at 500 hPa

ML on GPU vs. CPU

For the first experiment we ran on Casper, we found that the GPU (left) and the CPU (right) runs of the Graphcast model exhibit very different behaviors.

Could be that GPU and CPU treat precision differently
Could be ML model architecture

For SFNO, there were small differences between GPU (left) and CPU (right). Pangu could only run on GPU.

Hakim and Masanam (2024) experiments

These experiments perturb a steady state to then isolate the propagation/amplification of this perturbation over time with tendency reversal. An unperturbed version of the model would run in a(n approximate) steady state.

Extratropical cyclone
Tropical heating

Extratropical Cyclone

First set of experiments is a replication of the Hakim and Masanam (2024) methodology: a small baroclinic wave-inducing perturbation is added to the upstream end of the Pacific Storm Track, then the model is integrated forward (100 timesteps / 25 days here) with the steady-state (unperturbed) tendency subtracted off at each timestep.

These preliminary results suggest that the baroclinic wave developing in Pangu-Weather is weaker than that in SFNO. Comparison with the original Hakim and Masanam (2024) figure for this experiment, it appears the strength of the Pangu-Weather disturbance is somewhat weaker.

While the signal of the baroclinic wave strongly suggests that both models internalize atmospheric dynamics, the growing instabilities outside of the perturbed region are also possibly meaningful. The instability that develops in both SFNO hemispheres' extratropics are of roughly wavenumber 4-6, while Pangu appears to develop a southern hemispheric high with a smaller rotating low pressure nearby. More analysis is needed before drawing conclusions from non-perturbation region results.

Weakly propagating for first week of simulation.

Strongly propagating within first timesteps.

Other Stuff - E2Studio

Simple perturbation experiments on SFNO and FCN using a zonal mean flow. We apply the tendency reversion and show the difference between the initial state and model state at each time of moving forward.

Perturbed mid-latitudinal zonal flow w/ tendency reversion using E2studio [FCN]

Use Earth2studio
Create a zonally perturbed initial condition
Step forward in time
Used tendency reversion to remove the model state tendency

Perturbed mid-latitudinal zonal flow w/ tendency reversion using E2studio [SFNO]

Use Earth2studio
Create a zonally perturbed initial condition
Step forward in time
Used tendency reversion to remove the model state tendency

Page updated

Report abuse