OLCF-5: HP Frontier Supercomputer

Introduction:


Hewlett Packard Enterprise's Frontier (OLCF-5) is the world's first exascale supercomputer currently hosted at the Oak Ridge Leadership Computing Facility in Tennessee, USA. It is based on the Cray EX and is the successor to Summit (OLCF-4). It is currently the world's fastest supercomputer. It is built using AMD CPUs and GPUs. It is jointly maintained and operated by Oak Ridge National Laboratory and the US Department of Energy.


The total development budget of the Frontier Supercomputer is $600 Million and occupies a total space of 7,300 sq ft. It tops the Green500 list by consuming 62.68 gigaflops/watt power, becoming also the most efficient supercomputer out there.


Motive:


The Frontier system uses 4:1 GPU to CPU ratio with top of the line hardware to enable optimal performance for high-performance computing and AI workloads at exascale. To make this performance seamless to consume by developers, ORNL and Cray partnered with AMD to co-design and develop enhanced GPU programming tools which will tightly integrate with the existing AMD ROCm open computing platform. In addition, Frontier supports many of the same compilers, programming models, and tools that have been available to OLCF users on both the Titan and Summit supercomputers.


Design:


The Frontier system consists of more than 100 Cray Shasta cabinets with high density compute blades powered by HPE Slingshot 64-port switch that provides 12.8 terabits/second of bandwidth. Groups of blades are linked in a dragonfly topology with at most three hops between any two nodes. Cabling is either optical or copper, customized to minimize cable length. Total cabling runs 145 km. Frontier is liquid-cooled,allowing 5x the density of air-cooled architctures.

*The Linpack benchmark operates by solving simultaneous linear equations in a standard form and then analyzing the computer's performance.


HIP Support:


HIP is a C++ runtime API that allows developers to write portable code to run on AMD and NVIDIA GPUs. It is essentially a wrapper that uses the underlying CUDA of ROCm platform that is installed on a system. In addition, HIP provides porting tools which can be used to help port CUDA codes to the HIP layer, with no loss of performance as compared to the original CUDA application.


Machine Learning:


Frontier is fine-tuned to run AI workloads. The vendors provide a suite of fully optimized, scalable data science tools. In future, widely used frameworks like TensorFlow, BigDL for Apache Spark, PyTorch, MXNet, Keras, MLib, scikit-learn, OpenCV will be available in addition to the Cray Programming Environment Deep-Learning Plugin. The plugin improves algorithms and performance for training of deep neural networks and provides support for Apache Spark, GraphX, MLib, Alchemist framework, and pbdR.