Session 1: Applications
Data-related Optimizations for Various Applications on the Sunway supercomputing System
Lin Gan, Tsinghua University, and the National Supercomputing Center in Wuxi
As the first computing system in the world to provide a peak performance
of over 100PFlops, and to have more than 10M computing cores, the
Sunway TaihuLight supercomputer has been widely applied in many major
applications, such as climate modelling, geophysics exploration, life
science, etc. As the major processor to form the system, the SW26010 CPU
is a homegrown many-core chip that contains 260 cores as well as a
hierarchy of different memories. In order to maximise the computing
efficiencies and to scale up the performance, programmers have to
consider as many optimising methods as possible, when dealing with
various kernels from different applications. Data-related optimisations
that fit well with the many-core architecture is therefore important and
necessary.
This talk will cover a set of novel optimising techniques that are
proposed to maximise the data efficiencies, such as the tunings on the
data locality, reuse, and the wise trade-off for the data movement. A
lot of applications have been benefited from these methods to have
achieved significant performance improvement and inspiring
scalabilities.
Benchmarking Stencil on GPUs
Naoya Maruyama, Lawrence Livermore National Laboratory
Stencil
is one of the most common computing patterns in scientific computing.
Its low arithmetic intensity often requires extensive optimization for
reducing memory access overheads. We have previously reported a case
study using older generations of NVIDIA GPUs, which presented several
effective (and not very effective) optimization techniques. In this
talk, we will update the performance study with newer generations of
GPUs and discuss key architectural differences in those accelerators
from a perspective of memory bound computations.
Design and Optimization of a Multi-stencil CFD Solver
Aparna Chandramowlishwaran University of California, Irvine
Solving
Navier-Stokes equations is a challenging multi-scale multi-physics
problem. Our solver is a realistic structured mesh finite-volume code
and presents many challenges of real applications. Specifically, our
solver consists of multi-stencils with different computational
intensities and memory access patterns. While the challenges of
stencil-stencil kernels still apply to this problem, solving a
multi-stencil kernel poses new challenges, since each stencil has a
unique pattern and requires access to different neighbors. In this talk,
I will outline how multiple stencils impact optimization and discuss
techniques for improving locality and parallelism by trading off
redundant work. We also ask whether CFD applications can be expressed in
stencil DSLs and whether such an implementation can deliver a
sufficient combination of optimizations to compete with a hand-tuned
code and what are its limitations.
Need for data locality in deep learning
Prasanna Balaprakash, Argonne National Laboratory
The
potential of machine learning (ML) has prompted hardware vendors to
design and develop new custom hardwares (such as GPUs, TPUs, and FPGAs)
to accelerate ML applications. Consequently, the speed of advancement of
these chips has exceeded the ability of ML algorithms to fully utilize
them. In particular, many custom chips provide optimizations for data
locality and they need algorithms with significant data
locality---however, not all machine learning workloads are currently up
to the task. Several ML algorithms do not expose enough locality to
fully exploit the new generation of hardware. In this talk, we will give
a high-level overview on a class of deep learning algorithms that do
not expose data locality and discuss possible strategies to address the
problem.
Blocking Optimization for Tensor Decomposition
Jee Choi, IBM Research
Many
social and scientific domains give rise to data with multi-way
relationships that can naturally be represented by tensors, or
multi-dimensional arrays. Decomposing – or factoring – tensors can
reveal latent properties that are otherwise difficult to see. However,
due to the relatively recent rise in popularity of tensor decomposition
in HPC, its challenges in performance optimization is poorly understood.
In this presentation, I will explain the steps taken to identify and
isolate the major bottlenecks in tensor decomposition algorithms, and
demonstrate significant speedup over prior state-of-the-art using
various cache blocking mechanisms.
Session 2: Memory Technologies
Exposing the locality of new memory hierarchies to HPC applications
Brice Goglin, Inria Bordeaux
Memory hierarchies gets deeper and larger, with multiple NUMA nodes and different memory technologies. Many hardware details could be exposed to HPC users to ease their task and data placement decision. Hardware
characteristics such as latencies and bandwidths, or the path(s)
between cores and each memory bank, could be used as decision criteria.
However, users rather want simple abstractions to understand this
hardware organization. We will present how we try to expose such
information in the hwloc tool in a simple and convenient but still
exhaustive manner.
System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy
Francois Tessier, Argonne National Laboratory
Supercomputers
are increasingly being architected with deeper and diverse memory
subsystems such as NVRAM, MCDRAM and network-attached burst buffers, and
interconnect topologies such as dragonfly and Tori. From an application
perspective, this leads to an increased complexity to fully exploit
these system characteristics for data movement. Our research tackles
this area and one focus area is on effective abstractions of the various
memory and storage tiers, as well as the network topology, to ensure
scalable and portable performance.
Bandwidth and Object Placement Studies on HBM
Didem Unat, Koç University
High
Bandwidth Memory (HBM) in HPC systems allows memory-bound applications
to perform better. Due to the limitation in size, HBM systems work in
tandem with DRAM which acts as the high capacity memory component. We
study the effect of such heterogeneous systems on HPC application. In
particular, we study Intel KNL and its HBM, namely MCDRAM. By performing
various benchmarks, we acknowledge that the bandwidth of MCDRAM is
roughly 5x of a traditional DDR. With such significant gains, it becomes
important to devise an allocation scheme for the objects in an
application. We conclude that the read speed for the different memories
on KNL is higher than the write speed. Therefore, for maximizing the
performance of an application, we suggest to keep the write intensive
data on MCDRAM. We also propose a placement algorithm which takes into
account the individual read and write reference counts and the sizes of
all objects in the application.
Exploiting Data Locality for Unified CPU/GPU Memory Space using OpenMP
Lingda Li, Brookhaven National Laboratory
To
facilitate GPU programming and accelerate applications with large
working set, recent GPUs support unified CPU/GPU memory space
addressing. In such systems, CPU and GPU can conveniently address each
other's memory and data is moved between different memory on demand.
However, our investigation shows that the current data migration
mechanism is not aware of data locality, resulting in poor performance
and redundant data movement. The upcoming OpenMP 5.0 will include new
data locality features to address the complex memory hierarchy in
today's systems, however, current proposed features do not take unified
memory into consideration and cannot address its performance problem.
To
solve this problem, we propose to extend OpenMP data locality features
to improve unified memory performance. The proposed extensions expose
different GPU memory management choices for programmers to exploit GPU
data locality explicitly. In scenarios with complex data locality,
programmers are also allowed to pass hints so that OpenMP compiler and
runtime make better GPU data management decisions. Preliminary
evaluation shows that our proposal can significantly improve GPU
performance and reduce redundant data movement between CPU and GPU for
benchmarks with large working set.
Deep Memory Abstractions: Beyond Caches and NUMA Nodes
Swann Perarnau, Argonne National Laboratory
Future
platforms are expected to feature several layers of memory, with a mix
of on-package (MCDRAM or HBM), regular DDR and NVRAM resources available
to applications depending on the platform. To use these deep memory
hierarchies efficiently, applications will need to integrate portable
and efficient memory management strategies dealing with placing and
moving data according to computational needs.
Many vendors
currently advocate for the hardware and operating systems to abstracts
these deep memory layers either as new cache levels (KNL cache mode) or
as NUMA topologies (KNL flat mode). These abstractions are familiar to
users, and require less development work for users overall.
Unfortunately, their implementations are costly, prone to errors and
result in set-in-stone policies that might not provide the best
performance for some data access patterns.
In this talk, I will
review some of the issues the current abstractions for deep memory
exhibit. I will then advocate instead for the development of
system-level abstractions that explicitly integrate information relevant
to topology, data placement and data migration. Such abstractions might
then be used by programming languages or runtimes to design better
memory management policies transparent to end users.
Chapel's New Adventures in Intra-node Locality
Brad Chamberlain, Cray Inc.
I'll
start this talk with a brief review of Chapel's key concepts for
expressing data locality: locales, on-clauses, and domain maps. From
there, I will describe our recent implementation efforts tuning for, and
giving the user control over, locality on current processor
architectures such as those with multiple NUMA domains and the MCDRAM
available on Intel Xeon Phi ("KNL"). As time permits, I will attempt to
shine a light on the interplay between memory locality and the network
interface for PGAS languages like Chapel. I expect to back up my
characterizations with performance measurements from benchmarks and/or
proxy applications as much as possible.
Extreme Heterogeneity and Flexibility Challenges for Data Locality
Andrew A. Chien, University of Chicago
Technology scaling is driving extreme customization (multiple
accelerators) and flexible configurability (memory structure). We will
highlight these trends, and the challenges they represent for
programming and efficiency.
Session 3: Runtime Systems
Kokkos Task-DAG: Memory Management and Locality Challenges Conquered
H Carter Edwards, Sandia National Laboratory
Kokkos'
new task-DAG capability supports the portable parallel execution of a
dynamic, heterogeneous set of fine-grain tasks with dynamic directed
acyclic graph (DAG) of dependences. Significant memory management and
memory locality challenges had to be conquered for portability,
performance, and usability to attached GPUs. First, tasks must be
scheduled and execute on the GPU's lean runtime without intervention or
interaction with the CPU.Second, a dynamic task-DAG requires a
thread-scalable memory management that operates within a bounded span of
GPU memory. Third, the GPU does not provide coherency among core's L1
caches. Fourth, algorithms written using the task-DAG capability must be
insulated from all of these challenges.
Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures
Hatem Ltaief, KAUST
We
introduce the first asynchronous, task-based formulation of the polar
decomposition and its corresponding implementation on manycore
architectures. Based on a new fine-grained formulation of the iterative
QR dynamically-weighted Halley algorithm (QDWH) for the calculation of
the polar decomposition, the novel task-based implementation is capable
of taking advantage of the identity structure of the matrix involved
during the QDWH iterations, which decreases the overall algorithmic
complexity. Furthermore, the artifactual synchronization points have
been weakened compared to previous implementations, unveiling look-ahead
opportunities for better data locality and hardware occupancy. The
overall QDWH-based polar decomposition can then be represented as a
directed acyclic graph (DAG), where nodes represent computational tasks
and edges define the inter-task data dependencies. The StarPU dynamic
runtime system is employed to traverse the DAG, to track the various
data dependencies and to asynchronously schedule the computational tasks
on the underlying hardware resources, resulting in an out-of-order task
scheduling. Benchmarking experiments show significant improvements
against existing state-of-the-art high performance implementations
(i.e., Intel MKL and Elemental) for the polar decomposition on latest
shared-memory vendors' systems (i.e., Intel Haswell/Broadwell/Knights
Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high
numerical accuracy.
Addressing Data Locality via Adaptive Runtimes and Higher Level Abstractions
Laxmikant (Sanjay) Kale, University of Illinois at Urbana-Champaign
Data
locality is increasing in importance in HPC. I will argue, based on
experience with Charm++ and other higher-level abstractions, and on
multiple parallel applications that (a) encapsulation provided by
over-decomposed entities promotes locality, and often without need for
hierarchical abstractions (b) persistence is an important property of
HPC applications that helps in management of locality, especially via
adaptive RTSs. (c) macro-dataflow as an intermediate representation
empowers an aRTS to do effective locality-aware scheduling and
data-movement. I will also express my views on how and which higher
level abstractions can help improve locality.
Data Locality - Old, New and Future in Legacy Fortran and C++ codes
Martin Berzins, SCI Institute, University of Utah
Data
Locality in legacy Fortran and C++ code is addressed from the point of
view of rethinking the design of a potential Navy next generation
weather code which uses legacy Fortran and from the point of view of the
asynchronous task-based Uintah framework which has legacy C++
components. In particular we look at the performance of the code on
i"modern architectures such as both GPUs and Intel KNLs. At the same
time in doing so we look forward to how the codes can evolve to future
architectures beyond the existing targets of today. In both cases we
look to how the required performance portability may be achieved with
minimum effort on the users part by changes to how tasks are executed
and in particular how this may be done by the runtime system and what
are the costs of doing so in a portable way.
HiHAT: Retargetable Infrastructure for Tomorrow's Systems
CJ Newburn, NVIDIA
The
increase of diversity in future platforms breeds complexity. There
will be more kinds of specialized computing elements, and the memory
system will be composed of more kinds, more layers and more locations.
Software abstractions enable application developers to enjoy the
benefits of this specialization without having to become experts in a
myriad of technologies. ""Ninja"" experts will be able to provide
highly-tuned access to each new feature, and those operating in a tuning
role will be able to configure software infrastructure to dispatch user
requests to target-specific implementations.
A new
community-wide project called HiHAT, for Hierarchical Heterogeneous
Asynchronous Tasking, serves as retargetable infrastructure underneath
tasking and language runtimes, and enables them to be ported to make
optimal use of new architectural features. We introduce HiHAT's design,
and show how it provides a way to make best use of innovative new
technologies. We present an initial HiHAT prototype, evaluate its
overheads, and demonstrate its applicability to a sample application,
molecular orbitals, on a system with a mix of CPUs and GPUs and
different kinds of memory.
Data Locality Challenges in Irregular Applications for Exascale Programing
Min Si, Argonne National Laboratory
Although
the well-known Bulk Synchronous Parallel model still takes an important
role in HPC application programming, a completely data-driven and
irregular computation model is being utilized by a set of applications
especially in chemistry and bioinfomatics domains. The model usually
relies on a global shared address space concept where the physical
distribution of application data is hidden from application algorithms,
and the parallelism is based a random distribution mode that every
process randomly requests a computation task for the next chunk of
the ``global shared'' data in parallel to achieve better computation
load balancing. This model, however, may fall into critical
communication bottleneck when scientists start looking into more complex
domain problems in exascale computing which show an increasing demand
for data movement. It this talk, we focus on the communication model in
the irregular applications and discuss potential optimizations for
exascale computing by utilizing data locality techniques.
Session 4: Performance Modeling
Performance model generation: Challenges and Approaches
Boyana Norris, University of Oregon
Data
locality in modern architectures is often difficult to exploit because
memory is used and managed by multiple software and hardware layers,
mostly outside the control of the HPC application developer. Performance
tools provide rather coarse grained view of memory performance, which
is not necessarily helpful in pinpointing locality-related problems or
optimizations. In this talk, we overview some approaches to generating
parameterized performance bounds that can produce models of memory and
overall performance that is not specific to a single architecture. The
goal is to enable better understanding of the data locality
characteristics of a particular implementation without relying on noisy
and coarse-grained measurements.
Coarse-Grained Data Locality Optimization Algorithm
Mohamed Wahib, AIST-TokyoTech Open Innovation Lab
Optimization
for data locality, targeted by compilers (e.g. loop fusion) and
programmers (e.g. loop tiling) is traditionally limited in granularity
to affine code regions. On the other hand, some categories of programs
can benefit from static data locality optimization when applied at a
coarser granularity; code in loosely coupled segments. That includes,
for example, a) applications with memory-bounded kernels offloaded to
discrete accelerators, b) coarse-grained pipelined stream programs, and
c) static tasks of task-based parallelism models deployed in manycore
processors. Such coarse-grained optimization comes down to applying a
transformation based on fission and fusion of the loosely-coupled code
segments under constraints of inter-dependencies, precedencies, and
architecture features. However, the solution space for this type of
optimization can grow exponentially, rendering deterministic methods
intractable for large-sized real-world applications. In this talk, we
introduce an efficient evolutionary algorithm, LOGGA (Locality
Optimization Grouped Genetic Algorithm), designed for coarse-grained
data locality optimization.
Unveiling the Interplay Between Global Link Arrangements and Network Management Algorithms on Dragonfly Networks
Vitus J. Leung, Sandia National Laboratories
Network
messaging delay historically constitutes a large portion of the
wall-clock time for High Performance Computing (HPC) applications, as
these applications run on many nodes and involve intensive communication
among their tasks. Dragonfly network topology has emerged as a
promising solution for building exascale HPC systems owing to its low
network diameter and large bisection bandwidth. Dragonfly includes local
links that form groups and global links that connect these groups via
high bandwidth optical links. Many aspects of the dragonfly network
design are yet to be explored, such as the performance impact of the
connectivity of the global links, i.e., global link arrangements, the
bandwidth of the local and global links, or the job allocation
algorithm.
This paper first introduces a packet-level simulation
framework to model the performance of HPC applications in detail. The
proposed framework is able to simulate known MPI (message passing
interface) routines as well as applications with custom-defined
communication patterns for a given job placement algorithm and network
topology. Using this simulation framework, we investigate the coupling
between global link bandwidth and arrangements, communication pattern
and intensity, job allocation and task mapping algorithms, and routing
mechanisms in dragonfly topologies. We demonstrate that by choosing the
right combination of system settings and workload allocation algorithms,
communication overhead can be decreased by up to 44%. We also show that
circulant arrangement provides up to 15% higher bisection bandwidth
compared to the other arrangements; but for realistic workloads, the
performance impact of link arrangements is less than 3%.
Towards provably locality-enhancing work stealing algorithms
Kenjiro Taura, University of Tokyo
The
random work stealing is an algorithm of choice for load-balancing
dynamically generated, irregular, and/or potentially unbalanced
workloads among cores on a shared-memory node. It performs well
primarily because (1) it has no central bottlenecks and (2) it preserves
a certain degree of temporal localities that exist in the serial
execution, in the sense that tasks executed back-to-back in the serial
execution tend to be executed back-to-back on the same core in parallel
executions as well. Yet there will be little hope that the original
random work stealing, which will choose a victim worker uniformly among
all workers, will scale to thousands of cores, let alone thousands of
distributed memory nodes, due to the uniformly random communication
patterns the task migration itself causes and the resulting remote data
access patterns that will not respect machine hierarchies. It is thus
natural to ask if we can improve this "uniformly random" work stealing
into a locality-aware one whose task migrations better respect the
machine's locality hierarchy. A number of heuristics has been proposed,
but we are not aware of any algorithm that gives a provable guarantee
whose running time is better than uniformly random victim selection. In
this talk, I shed some lights on why such an algorithm is not easy to
come by and explain an algorithm we have worked on so far.
OpenMP, Unified Memory, and Prefetching
Hal Finkel, Argonne National Laboratory
State-of-the-art
accelerator hardware, such as the NVIDIA GPUs that will power upcoming
supercomputers, will support unified virtual memory (UVM). This will
allow the GPU to fetch data from the host via on-demand paging, both for
convenience and to support working sets larger than the device's
memory. While OpenMP's accelerator offload model generally must be
implemented by fully copying mapped data to and from the device, with
UVM this is no longer the case. In fact, the runtime library does not
need to explicitly copy any data at all. However, on-demand paging
proves to be an expensive way to access data. Using cost-model-driven
software prefetching, and automated pipelining, informed by
profile-guided compiler analyses, we can mitigate the costs of on-demand
paging while still reaping the benefits of UVM. This talk will discuss
the implementation of these techniques within LLVM and its OpenMP
runtime along with some performance results.
Matrix-Free Techniques for Fast PDE Solvers
Karl Rupp, Argonne National Laboratory
Essential
techniques for high-performance matrix-free residual computations in
numerical partial differential equation (PDE) solvers are presented via
two example applications. First, a variable-coefficient Laplace problem
motivates a technique we call "thread transposition" to vectorize over
elements on GPUs. Second, a Q2-P1 mixed finite element discretization of
the Stokes equations outlines vectorization opportunities on CPUs and
GPUs for tensor bases. These examples show how small details can lead to
substantial differences in the implementations, hence one must not
expect to find a one-size-fits-all solution for PDE solvers.
Session 5: Compilers and DSLs
How Smart Should Compilers Be When Handling Abstractions for Locality?
Michelle Strout, University of Arizona
As
early as the HPF language, there have been mechanisms to specify to
compilers how to partition data, layout data, or other locality-centric
information. Examples of these mechanisms include type declarations,
pragmas, and function calls. These mechanisms require the compiler to
contain various levels of complexity (or smarts). For an example of a
compiler needing “smarts”, the HPF compiler was responsible for dividing
up the computation given a programmer-specified domain decomposition.
In the Loop Chain project, we are currently developing loop chain
pragmas for scheduling across loops in a way that balances data locality
and parallelism. We have an ongoing debate about what programers
should specify in terms of the loop chain schedule versus what the
compiler should figure out. In this talk, I overview how loop chain
pragmas navigate this tradeoff and compare that with some other
examples.
The Loop Chain Abstraction: A Demonstration
Catherine Olschanowsky, Boise State University
Exposing
opportunities for parallelization while explicitly managing data
locality is the primary challenge to porting and optimizing existing
computational science simulation codes to improve performance and
accuracy. OpenMP provides many mechanisms for expressing parallelism,
but it primarily remains the programmer's responsibility to group
computations to improve data locality. The Loop Chain abstraction, where
a summary of data access patterns are included as pragmas associated
with parallel loops, provides compilers with sufficient information to
automate the parallelism versus data locality tradeoff.
In this
talk I will present the syntax and semantics of loop chain pragmas for
indicating information about the loops that belong to the chain and
orthogonal specification of a high level schedule (i.e., transformation
plan) for the whole loop chain. A computational fluid dynamics
mini-application will be used to demonstrate the pragmas.
Compiler-Based Code Generation for Stencil Computations with Bricked Data Layout
Protonu Basu, Lawrence Berkeley National Laboratory
Optimizing stencil computations on modern architectures with wide SIMD
needs several code optimization techniques to improve data locality,
manage computation intensity and generate wide SIMD friendly code. We
present an automated code generation technique which uses a bricked data
layout. A structured grid is laid out as collection of small bricks,
with brick volume being a multiple of SIMD width and an autotuning
candidate. Furthermore, bricks lack ghost zones which differentiates
this technique from domain decomposition. Thus stencil computations
involve a collection of neighboring bricks. We discuss stencil
optimizations with the bricked data layout in the context with
techniques such as iteration space tiling, data tiling, stencil
reordering and low-level optimizations to target intrinsics or generate
native-compiler friendly code. Finally we discuss how this can be used
to implement domain specific languages or tools.
Two-way Python-Fortran transpilation for performance and maintainability
Mateusz Bysiek, Tokyo Institute of Technology
Python
enables an easier expression of complex computational problems.
However, this simplicity of high-level languages is usually associated
with a huge performance penalty. In this talk, we introduce an
HPC-oriented approach to bidirectional source-to-source translation
between Fortran and Python, which opens up the possibility of migrating
legacy scientific code (written in Fortran) to Python while preserving
its original performance. Furthermore, we will give an outlook into
potential future applications of this approach, with emphasis on
abstractions of data locality.
DAPP - Data-Centric Parallel Programming for Shared and Distributed Memory Systems
Tal Ben-Nun, ETH Zurich
With
the rapidly-changing landscape of computing architectures and systems,
application programming complexity is increasing beyond the skill-set of
the average scientist. We present DAPP, a unified programming
abstraction and system that facilitates performance-portable
heterogeneous computing on distributed memory environments. DAPP
consists of a three-stage pipeline, transforming programs into
specialized per-architecture code by applying optimizations such as
tiling, fusion and vectorization. This program transformation is made
possible by decoupling memory access from processing in the programming
phase, and using performance modeling for specialization and tuning.
Specifically, we formulate a Data-Centric Intermediate Representation
(DCIR) that encodes data movement costs and enables flexible,
potentially custom, program optimization routines. DAPP programs are
implemented over the Python programming language, commonly used in
scientific applications and industry. We show how Python can be extended
without modifying the language itself, using runtime AST parsing and
modification for code transformation purposes. We demonstrate the
capabilities of DAPP on several applications in linear algebra and
climate modeling, showing that it maintains both expressiveness and high
performance.
|