Cloud Resolving Modeling on Exascale Computers

Mark Taylor, Sandia National Lab

Abstract:
I will give an overview of the Energy Exascale Earth System Model (E3SM) project's work porting our atmosphere components to GPUs to accelerate cloud resolving global atmospheric simulations. In order to support DOE's upcoming GPU based Exascale systems, we have rewritten our atmosphere component model in C++, with Kokkos used to abstract the execution model for on-node parallelism. We have found the C++ approach to be more robust and better supported then a Fortran+directives based approach across several different GPUs ( NVIDIA, AMD and Intel). For our C++/Kokkos code, we have also worked to maintain Fortran-level performance on CPU systems through the use of "packs" in order to ensure proper vectorization. We refer to the cloud resolving configuration of the E3SM model as SCREAM - "The Simple Cloud Resolving E3SM Atmosphere Model". It is a full featured atmospheric global circulation model with state-of-the-art parameterizations for microphysics (SHOC), moist turbulence (P3) and radiation (RTMGP++). SCREAM’s nonhydrostatic dynamical core (HOMMEXX) uses a vertically Lagrangian method, a spectral element based horizontal discretization, HEVI IMEX timestepping and a conservative high-CFL semi-Lagrangian transport method. All components of SCREAM have now been ported to C++/Kokkos. I'll describe the performance of the C++ code compared to the original Fortran code, as well as a comparison between various GPUs and CPUs. Our fastest results were obtained on DOE's first Exascale system, Frontier, using 32,000 AMD MI250 GPUs and obtaining atmosphere component speeds greater than 1 SYPD at 3.25 km resolution.

Bio:
Mark Taylor specializes in numerical methods for parallel computing and atmospheric flows. He currently serves as Chief Computational Scientist for the DOE's Energy Exascale Earth System Model (E3SM) project. He led the development of the spectral element based dynamical core used in E3SM's atmospheric model. Mark received his Ph.D. from New York University's Courant Institute of Mathematical Sciences in 1992. He joined Sandia National Laboratories in 2004 and was promoted to Distinguished Member of the Technical Staff in 2018. In 2014 he was awarded (with Drs. David Bader and William Collins) The Secretary of Energy Achievement Award for his work unifying the Department of Energy's climate modeling research community, enabling the development of high-resolution fully-coupled climate-system simulations. He is currently a member of the Community Earth System Model's Scientific Steering Committee.

Summary:

E3SM:
- 8 DOE labs and universities
- 50 FTEs over 100 staff
- https://github.com/e3sm-project
- https://e3sm.org/
- Started in 1990s, working with the Community Earth System Model (CESM)
- Collaborated with NCAR on CESM until 2015
- Since 2015 DOE’s work focused on E3SM that focused on DOE’s climate questions and exascale computers
Components:
- Dynamical Core: HOMEXX: https://climatemodeling.science.energy.gov/research-highlights/hommexx-10-performance-portable-atmospheric-dynamical-core-energy-exascale
- Core: Flux coupler
- Land: ELM/MSART
- Land Ice: MSART-LI
- Atmosphere EAM/EAMxx: https://climatemodeling.science.energy.gov/news/introduction-eamxx-and-its-superior-gpu-performance
- Sea Ice: MPAS: https://e3sm.org/mpas-6-0/
- Ocean: MPAS: https://e3sm.org/mpas-6-0/
Atmosphere model:
- Dynamical core
  - Sets the global grid and numerical methods
  - E3SM: Cube-sphere grid
    - More structured grid for which it is easier to find a stable numerical method
    - Cubes are also a good match for GPUs and similar block-structured computational hardware
- Column physics
- Spectral element discretization
  - Static refinement
  - Cube-shaped grids at different resolutions with scaled triangle grids connecting them
- Scales:
  - 100km: CMIP6
    - 64 simulated years per day on 85 CPU nodes
  - 25km: Regionally refined
    - Clouds captured via sub-scale models configured via free parameters
  - 3.25km: Storm resolving/cloud permitting (current focus)
    - 1 simulated year per day on 32k GPUs
  - 1km: Full cloud resolving (next goal)
    - Small-scale regional models can be run at this scale today
    - Demonstrate the increased accuracy from this scale
- SCREAM: Cloud resolving model
  - https://climatemodeling.science.energy.gov/technical-highlights/simple-cloud-resolving-e3sm-atmosphere-model-scream
  - Features:
    - Resolved-scale fluid dynamics
    - Microphysical processes
    - Aerosols are prescribed
    - No convection
  - Simulations are now accurate enough to be directly compared to satellite observations
  - Capture the diurnal cycle of when rain occurs during the day
Exascale
- Strategy focuses on GPUs
- 2022: NERSC Permutter (8MW)
  - 6k NVIDIA GPUs
  - 3k CPU-only nodes
- 2023: OLCF Frontier (30 MW)
  - 38k AMV MI250X GPUs
  - 1 CPU + 4 GPUs per node
- 2024: ALCF Aurora (40MW)
  - 64k Intel datacenter GPU max
  - 2 CPU + 6GPUs per node
- GPUs are widely available
  - Easier way to hit LINPACK exascale
  - Takes a lot of work to port climate models to GPUs
  - CPUs are easier to work with
  - Would be fine if CPUs dropped coherency since MPI is used for communication
- Hybrid hardware mapping
  - Land: CPU (low-resolution)
  - Cloud-resolving atmosphere: GPUs (high-resolution)
  - Ocean: CPU (low-resolution today, will increase resolution in future and port to GPUs)
Programming Models
- Used by E3SM & SCREAM:
  - C++ with parallel arrays (Kokkos or YAKL)
  - Available very quickly
- Others:
  - Fortran with OpenACC and OpenMP
    - Less mature model that is less supported by vendors
    - However, a lot of legacy code is still in Fortran
  - Domain-specific languages
    - Too niche
    - Staff not sufficiently familiar with them
- Kokkos: C++ library that provides an abstraction layer around on-node parallelism code
  - Implementations for Cuda, OpenMP, Serial, Pthreads, etc.
  - Well supported, easy way to use new hardware
- Downside of C++
  - Code is more complicated
  - Opportunity for computer scientists to help with an easier algorithm specification
  - Challenging to vectorize C++ code
- YAKL: Yet Another Kernel Launcher
  - Simplified version of Kokkos that looks Fortran-y
  - E.g. Multi-dimensional arrays
- Others GPU porting approaches: Fortran on CPU, Fortran/ACC, GT4PY (Python-based DSL), PSYCLONE (DSL), Julia
- Useful features to have:
  - Hierarchical parallelism
  - Support for load balancing
Performance
- Analysis compares GPU vs CPU-only compute nodes
  - GPU nodes
    - Have multiple GPUs and use a lot more power than pure-CPU nodes
    - Cost > 4x
  - Whole system power consumption of GPU-based clusters is 3x CPU based clusters
- Dynamical core: HOMMEXX
  - GPU nodes work best for more spectral elements (more work), worse for fewer
  - Both GPUs and CPUs are improving steadily over time
  - Power efficient hardware (on FUGAKU) may be slower but since we can pack more such does in a given power envelope
    - This should work well for cloud resolving models that have a lot of work
    - Still early for this hardware family
- Performance portability
  - C++ code is slightly faster than Fortran
  - Suspect that in Fortran parallelization is done by compiler while in C++ it is hand-engineered (C++ auto parallelization is bad)
  - Newer hardware is improving performance
  - GPU nodes are faster than CPU nodes (but use more power)
  - At 3km resolution performance scales linearly upto 4k nodes then then reduces
  - For 1km resolution models should scale linearly to 30k nodes
Digital twins for Climate Science (ML)
- Should be possible to train digital twin using simulation-based models
  - Much faster than original model
- Many approaches:
  - Train ML from observations?
  - Train ML from simulations?
  - Run low-res model + ML-based bias correction?
  - Use ML for choosing free parameters?
  - Global 1km models are expensive but can
    - Run models regionally or
    - Run coarse global modes where local regions are run at higher resolution