Cloud Resolving Modeling on Exascale Computers
Abstract:
I will give an overview of the Energy Exascale Earth System Model (E3SM) project's work porting our atmosphere components to GPUs to accelerate cloud resolving global atmospheric simulations. In order to support DOE's upcoming GPU based Exascale systems, we have rewritten our atmosphere component model in C++, with Kokkos used to abstract the execution model for on-node parallelism. We have found the C++ approach to be more robust and better supported then a Fortran+directives based approach across several different GPUs ( NVIDIA, AMD and Intel). For our C++/Kokkos code, we have also worked to maintain Fortran-level performance on CPU systems through the use of "packs" in order to ensure proper vectorization. We refer to the cloud resolving configuration of the E3SM model as SCREAM - "The Simple Cloud Resolving E3SM Atmosphere Model". It is a full featured atmospheric global circulation model with state-of-the-art parameterizations for microphysics (SHOC), moist turbulence (P3) and radiation (RTMGP++). SCREAM’s nonhydrostatic dynamical core (HOMMEXX) uses a vertically Lagrangian method, a spectral element based horizontal discretization, HEVI IMEX timestepping and a conservative high-CFL semi-Lagrangian transport method. All components of SCREAM have now been ported to C++/Kokkos. I'll describe the performance of the C++ code compared to the original Fortran code, as well as a comparison between various GPUs and CPUs. Our fastest results were obtained on DOE's first Exascale system, Frontier, using 32,000 AMD MI250 GPUs and obtaining atmosphere component speeds greater than 1 SYPD at 3.25 km resolution.
Bio:
Mark Taylor specializes in numerical methods for parallel computing and atmospheric flows. He currently serves as Chief Computational Scientist for the DOE's Energy Exascale Earth System Model (E3SM) project. He led the development of the spectral element based dynamical core used in E3SM's atmospheric model. Mark received his Ph.D. from New York University's Courant Institute of Mathematical Sciences in 1992. He joined Sandia National Laboratories in 2004 and was promoted to Distinguished Member of the Technical Staff in 2018. In 2014 he was awarded (with Drs. David Bader and William Collins) The Secretary of Energy Achievement Award for his work unifying the Department of Energy's climate modeling research community, enabling the development of high-resolution fully-coupled climate-system simulations. He is currently a member of the Community Earth System Model's Scientific Steering Committee.
Summary:
E3SM:
8 DOE labs and universities
50 FTEs over 100 staff
Started in 1990s, working with the Community Earth System Model (CESM)
Collaborated with NCAR on CESM until 2015
Since 2015 DOE’s work focused on E3SM that focused on DOE’s climate questions and exascale computers
Components:
Dynamical Core: HOMEXX: https://climatemodeling.science.energy.gov/research-highlights/hommexx-10-performance-portable-atmospheric-dynamical-core-energy-exascale
Core: Flux coupler
Land: ELM/MSART
Land Ice: MSART-LI
Atmosphere EAM/EAMxx: https://climatemodeling.science.energy.gov/news/introduction-eamxx-and-its-superior-gpu-performance
Sea Ice: MPAS: https://e3sm.org/mpas-6-0/
Ocean: MPAS: https://e3sm.org/mpas-6-0/
Atmosphere model:
Dynamical core
Sets the global grid and numerical methods
E3SM: Cube-sphere grid
More structured grid for which it is easier to find a stable numerical method
Cubes are also a good match for GPUs and similar block-structured computational hardware
Column physics
Spectral element discretization
Static refinement
Cube-shaped grids at different resolutions with scaled triangle grids connecting them
Scales:
100km: CMIP6
64 simulated years per day on 85 CPU nodes
25km: Regionally refined
Clouds captured via sub-scale models configured via free parameters
3.25km: Storm resolving/cloud permitting (current focus)
1 simulated year per day on 32k GPUs
1km: Full cloud resolving (next goal)
Small-scale regional models can be run at this scale today
Demonstrate the increased accuracy from this scale
SCREAM: Cloud resolving model
Features:
Resolved-scale fluid dynamics
Microphysical processes
Aerosols are prescribed
No convection
Simulations are now accurate enough to be directly compared to satellite observations
Capture the diurnal cycle of when rain occurs during the day
Exascale
Strategy focuses on GPUs
2022: NERSC Permutter (8MW)
6k NVIDIA GPUs
3k CPU-only nodes
2023: OLCF Frontier (30 MW)
38k AMV MI250X GPUs
1 CPU + 4 GPUs per node
2024: ALCF Aurora (40MW)
64k Intel datacenter GPU max
2 CPU + 6GPUs per node
GPUs are widely available
Easier way to hit LINPACK exascale
Takes a lot of work to port climate models to GPUs
CPUs are easier to work with
Would be fine if CPUs dropped coherency since MPI is used for communication
Hybrid hardware mapping
Land: CPU (low-resolution)
Cloud-resolving atmosphere: GPUs (high-resolution)
Ocean: CPU (low-resolution today, will increase resolution in future and port to GPUs)
Programming Models
Used by E3SM & SCREAM:
C++ with parallel arrays (Kokkos or YAKL)
Available very quickly
Others:
Fortran with OpenACC and OpenMP
Less mature model that is less supported by vendors
However, a lot of legacy code is still in Fortran
Domain-specific languages
Too niche
Staff not sufficiently familiar with them
Kokkos: C++ library that provides an abstraction layer around on-node parallelism code
Implementations for Cuda, OpenMP, Serial, Pthreads, etc.
Well supported, easy way to use new hardware
Downside of C++
Code is more complicated
Opportunity for computer scientists to help with an easier algorithm specification
Challenging to vectorize C++ code
YAKL: Yet Another Kernel Launcher
Simplified version of Kokkos that looks Fortran-y
E.g. Multi-dimensional arrays
Others GPU porting approaches: Fortran on CPU, Fortran/ACC, GT4PY (Python-based DSL), PSYCLONE (DSL), Julia
Useful features to have:
Hierarchical parallelism
Support for load balancing
Performance
Analysis compares GPU vs CPU-only compute nodes
GPU nodes
Have multiple GPUs and use a lot more power than pure-CPU nodes
Cost > 4x
Whole system power consumption of GPU-based clusters is 3x CPU based clusters
Dynamical core: HOMMEXX
GPU nodes work best for more spectral elements (more work), worse for fewer
Both GPUs and CPUs are improving steadily over time
Power efficient hardware (on FUGAKU) may be slower but since we can pack more such does in a given power envelope
This should work well for cloud resolving models that have a lot of work
Still early for this hardware family
Performance portability
C++ code is slightly faster than Fortran
Suspect that in Fortran parallelization is done by compiler while in C++ it is hand-engineered (C++ auto parallelization is bad)
Newer hardware is improving performance
GPU nodes are faster than CPU nodes (but use more power)
At 3km resolution performance scales linearly upto 4k nodes then then reduces
For 1km resolution models should scale linearly to 30k nodes
Digital twins for Climate Science (ML)
Should be possible to train digital twin using simulation-based models
Much faster than original model
Many approaches:
Train ML from observations?
Train ML from simulations?
Run low-res model + ML-based bias correction?
Use ML for choosing free parameters?
Global 1km models are expensive but can
Run models regionally or
Run coarse global modes where local regions are run at higher resolution