PADAL19‎ > ‎


Please submit your abstracts by August 10th.

Data locality in task scheduling, a runtime system perspective
Samuel Thibault, INRIA

The StarPU task-based runtime system makes compromises between task distribution and data movement cost on heterogeneous platforms.  To this end, it has been using for years "dmdas", a variant of the well-known HEFT
scheduling heuristic, which integrates the cost of data management and task priorities. Additionally, data prefetching and LRU-like data eviction allows to optimize for data locality, resulting in very good achievements on platforms equipped with GPUs.

Experimentation with out-of-core situations have however shown some limits of the HEFT strategy. The very narrow bandwidth of hard disks exacerbates a defect of HEFT: taking tasks priority into account before deciding their placement leads to degraded locality effects. The "dmdar" variant, which favors locality over task priority to some extents, gets better performance in such a situation. This shows that a proper balance still needs to be found between task priorities (which must not be neglected to keep good levels of parallelism and load balance) and data locality (to keep data transfers within the available bandwidth). StarPU proposes a scheduling platforms to implement various heuristics and experiment them with the different applications ported on top of it.

Session 1: Programming Models and Runtime Systems

Data Locality for Extreme Heterogeneous Systems
John Shalf, LBNL

The recent explosive growth in data analytics applications that rely on machine and deep learning techniques are seismically changing the landscape of datacenter architectures. These techniques, used for example in face and object recognition in pictures and video, place a tremendous load on datacenters with their need for intense compute performance and have led to the wide adoption of graphics processing units (GPU) accelerator and manycore (CPU) technologies, which are pushing current datacenter interconnect and memory architectures to their limit. The effective execution performance of these massive parallel architectures is determined by how data is moved among the numerous compute and memory resources, and dramatically affected by the enormous energy consumption associated with the necessary huge movements of data. Energy consumption dominated by the cost of data movement is now perhaps the single determining factor of future datacenter scalability, and if architecture specialization and extreme heterogeneity are successful in improving compute performance, the data-movement challenges will be further exacerbated.

MAESTRO: Middleware for Memory and Data-Awareness in Workflows
François Tessier, Swiss National Supercomputing Centre

Optimization of data movement is of paramount importance in the exascale era given the increasing I/O requirements expressed by HPC and HPDA workloads. However, the software stack has not been designed to meet these needs, while the emergence of new tiers of memory and storage continues to make this task more complex.

This talk will introduce the 3-years funded Maestro EU Project. The goal of Maestro is to develop a memory and data-aware middleware for data movement orchestration within workflows. Data is encapsulated in objects along with Maestro-related metadata while data movement decision is taken based on workflow annotations and real-time I/O monitoring. The main concepts behind Maestro will be presented as well as some aspects of the middleware design that the consortium proposes to implement.

Thoughts on Autonomous Resource Management for HPC
Ron Brightwell, Sandia National Laboratories

HPC systems have become too complex to manage every critical resource. Application developers today spend a significant amount of time and effort mapping an application to a machine, largely because locality management is done by hand. The current situation will become untenable as systems and applications continue to become more heterogeneous. This talk will discuss some of the challenges and basic capabilities required to enable autonomous resource management that allows for mapping a machine to an application.

Multi-Threading Effective Task Scheduling for Heterogeneous Computing
Karim Djemame, University of Leeds

Efficient application scheduling is critical for achieving high performance in heterogeneous computing systems. This problem has proved to be NP-complete, heading research efforts in obtaining low complexity heuristics that produce
good quality schedules. Although this problem has been extensively studied in the literature, all related algorithms
assume the computation costs of application tasks on processors are available a priori, ignoring the fact that the time needed to run/simulate all these tasks is orders of magnitude higher than finding a good quality schedule, especially in heterogeneous systems. Moreover, low complexity heuristics consider application tasks as single thread implementations only, but in practice tasks are normally split into multiple threads. We propose two new methods applicable to
several task scheduling algorithms, addressing the above problems in heterogeneous computing systems. We showcase both methods by using theHeterogeneous Earliest Finish Time (HEFT) well known algorithm, but this work is applicable to other algorithms too, such as the Heterogeneous Critical Parent Trees (HCPT), High-Performance Task Scheduling (HPS), Performance Effective Task Scheduling (PETS) and Critical-Path-on-a- Processor (CPOP). First, we propose a methodology to reduce the number of computation costs required by HEFT (and therefore the number of simulations), without sacrificing the length of the output schedule. Second, we give heuristics to find which tasks are going to be executed as Single-Thread and which as Multi-Thread implementations, as well as the number of threads used, without requiring all the computation costs. The experimental results considering both random graphs and real world applications show that extending HEFT with the two proposed methods achieves better schedule lengths, while at the same time requires up to 24 less simulations.

Runtime-assisted locality abstraction using virtual topologies: status and roadmap
Mustafa Abduljabbar, Chalmers University of Technology

As accelerators are equipped with deeper and heterogeneous memory hierarchies, applications need to cope with these architectural advancements resulting in data transfer rates that are not aligned with their arithmetic capabilities. Therefore, it is essential that runtime systems become aware of these attributes to enhance scheduling. Existing solutions are either restrictive in their platform-specific assumptions or can be difficult to program. To address the problem of expressing locality, we present a runtime-assisted abstraction that flexibly supports the mapping of software to hardware topologies and is provisioned using the XiTAO programming model. The runtime dynamically learns and chooses efficient and locality-aware resource partitioning. A clear advantage of this approach is gained due to supporting both data reuse via caches and data placement in memories using the virtual software topologies. In this talk, I will describe the roadmap of this work and the potential spectrum of applications in both HPC and AI worlds.

Programming for Affinity
Christian Terboven, RWTH Aachen University

Affinity is a key factor to achieve performance on contemporary heterogeneous systems with hierarchical memory. In such systems, programming for affinity requires a holistic consideration: placement of data within different kinds and locations of memory, binding of threads and execution of tasks close to data, minimization of the distance between host data and accelerator device for offloading, and so forth. This talk will summarize our corresponding contributions to OpenMP and present our current work and ideas to foster the discussion on programming abstractions for data locality.

Session 2: DSL/Compilers

FPGA Cluster as Custom Computing Engine for Supercomputers
Kentaro Sano, Center for Computational Science, RIKEN

Custom computing with dedicated circuits on FPGAs (Field-Programmable
Gate Arrays) is promising to accelerate computation that general-purpose multi-core processors are not good at. In our team, we have developed a prototype system with a cluster of Stratix10 FPGAs and a data-flow compiler which generates a pipelined custom hardware module to be embedded onto an FPGA and executed as stream computing. In this talk, I introduce the concept and the system of the FPGA Cluster to be designed as a custom computing engine for existing supercomputers.

A Data-Centric Approach to High Performance Computing Applications
Alexandros Nikolaos Ziogas, ETH Zurich

Extreme heterogeneity in high performance computing has led to a plethora of programming models for on-node programming. The increasing complexity of those approaches and the lack of a unifying model has rendered the task of developing performance-portable applications intractable. To address these challenges, we present the Data-centric Parallel Programming (DAPP) concept, which decouples program definition from its optimized implementation. The latter is realized through Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that combines fine-grained data dependencies with high-level control-flow and is amenable to program transformations. We demonstrate the potential of the data-centric viewpoint with OMEN, a state-of-the-art quantum transport (QT) solver. We reduce the original C++ code of OMEN from 15k lines to 3k lines of python code and 2k SDFG nodes. We subsequently tune the generated code for two of the fastest supercomputers in the world (June '19), and achieve up to two orders of magnitude higher performance; sustained 85.45 Pflop/s on 4,560 nodes of Summit (42.55% of the peak) in double precision, and 90.89 Pflop/s in mixed precision.

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels
Mohamed Wahib, AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology

This talk proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5× faster than Nvidia’s NPP on V100 and P100 GPUs.

Getting the mesh abstraction right: Delivering and generalising domain-specific optimisations for mesh-based computations
Paul H J Kelly, Imperial College London

Most compilers take code as input.  Firedrake starts with a DSL, called UFL, that captures the weak form of the PDE you are solving - together with a specification of the discretisation.  We also do something like compilation for the specific mesh on which we computer – this is often called an “inspector-executor” strategy: we inspect the mesh and the code and derive a reusable schedule.
The mesh can be thought of as a graph – but it has various special properties.  It is multipartite: it consists of sets of distinct sets of topological entities – vertices, edges, faces, cells – with maps between them.  The graph of a mesh has low maximum degree, and low chromatic number.  It usually has a high diameter, and is easily partitioned with controlled partition-to-partition communication.  Implementations of PDE solvers need to lower the abstract graph concept to a concrete storage layout, data distribution and numbering.  Implementations also need to schedule the computation to respect the dependencies implied by the mesh, and to minimise data movement.  In many cases, we have a sequence of mesh computations, so scheduling may span multiple sweeps.
This talk will offer some reflections on the space of potential scheduling strategies available to us, offering some examples of what has been done, what is possible, and where the frontiers lie.

Session 3: Performance Models and Tools

Cost Model for Application Configuration with Heterogeneity
Anshu Dubey, Argonne National Laboratory

Heterogeneity in both platforms and solvers within applications makes application configuration for reasonable performance an increasingly challenging problem. Cost-benefit analysis in some form is necessary to select configuration parameters for any instance of running an application on a platform. We have been exploring cost models that capture the tracked and measured, and databases can be built, which can become interaction between components of the application and components of the platform. Once cost models have been generated, costs can be useful resources for determining the configuration parameters. I will present a preliminary cost model for FLASH, which is a highly-configurable multiphysics software with reusable components,designed for solving a large class of problems that involve fluid flows and need adaptive mesh refinement (AMR).

Cache-Aware Roofline Model: Uncovering micro-architecture upper-bounds
Aleksandar Ilic, INESC-ID/IST, University of Lisbon, Portugal

Modern micro-architectures have been the subject of constant technological improvements, which has lead to the increased complexity of current computing systems. This complexity imposes several challenges when determining which hardware resources are responsible for preventing applications to achieve the maximum performance of the micro-architecture. To address this issue, simple models and analysis methods, such as Cache-Aware Roofline Model, were proposed in order to provide first-order insights regarding application execution, thus allowing software developers and hardware engineers to derive the most suitable optimization techniques to extract the maximum potential of micro-architectures. In this talk, a set of Cache-Aware Roofline Models are presented, aiming at providing a more accurate characterization of real-world applications when compared to the state-of-the-art approaches.

TopoMatch : a generic tool for topology mapping
Emmanuel Jeannot, INRIA

In this talk we will present a generic tool for topology mapping called TopoMatch. This tool is the successor of TreeMatch ( who was designed for tree topologies. With this new tool, we leverage on the Scotch graph partitioner to provide a tool that is able to perform process mapping for any arbitrary architecture. We will present the tool, its API and its features. As well as early results.

Level-Spread: A New Job Allocation Policy for Dragonfly Networks
Vitus J Leung, Sandia National Laboratories

The dragonfly network topology has attracted attention in recent years owing to its high radix and constant diameter. However, the influence of job allocation on communication time in dragonfly networks is not fully understood. Recent studies have shown that random allocation is better at balancing the network traffic, while compact allocation is better at harnessing the locality in dragonfly groups. Based on these observations, this paper introduces a novel allocation policy called Level-Spread for dragonfly networks. This policy spreads jobs within the smallest network level that a given job can fit in at the time of its allocation. In this way, it simultaneously harnesses node adjacency and balances link congestion. To evaluate the performance of Level-Spread, we run packet-level network simulations using a diverse set of application communication patterns, job sizes, and communication intensities. We also explore the impact of network properties such as the number of groups, number of routers per group, machine utilization level, and global link bandwidth. Level-Spread reduces the communication overhead by 16% on average (and up to 71%) compared to the state-of- the-art allocation policies.

Communication monitoring for threads
Didem Unat, Koç University

Inter-thread communication is a vital performance indicator in shared-memory systems. Prior works on identifying inter-thread communication employed hardware simulators or binary instrumentation and suffered from inaccuracy or high overheads—both space and time—making them impractical for production use. We propose ComDetective, which produces communication matrices that are accurate and introduces low runtime and low memory overheads, thus making it practical for production use. ComDetective employs hardware performance counters to sample memory-access events and uses hardware debug registers to sample communicating pairs of threads. ComDetective can differentiate communication as true or false sharing between threads. Its runtime and memory overheads are only 1.30× and 1.27×, respectively, for the 18 applications studied under 500K sampling period. Using ComDetective, we produce insightful communication matrices for microbenchmarks, PARSEC benchmark suite, and several CORAL applications and compare the generated matrices against MPI counterparts. Guided by ComDetective we optimize a few codes and achieve up to 13% speedup.

Session 4: Apps and Core Kernels

Low-Rank Cholesky Factorization Toward Climate and Weather Prediction Applications
Hatem Ltaief, KAUST

Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix—a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance and improve data locality. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130K cores, which represent a cornerstone toward fast and accurate predictions of environmental applications.

Unstructured computational meshes and data locality
Xing Cai, Simula Research Laboratory, Norway

Many scientific and engineering applications rely on unstructured computational meshes to capture the irregular shapes and intricate details involved. With respect to software implementation, unstructured meshes require indirectly-indexed, irregular accesses to data arrays. Attaining data locality in the memory hierarchy is thus challenging. This talk touches two related topics. First, we look at the ordering/clustering of entities in an unstructured mesh with respect to cache efficiency. Second, we re-examine the currently widely-used strategy of mesh partitioning, which is based on partitioning a corresponding graph with edge-cut as the optimisation objective. Mismatches between this mainstream methodology of data decomposition and the increasingly heterogeneous computing platforms will be discussed.

Halo-Update Communication Layer for Hybrid Computing
Mauro Bianco, ETH - CSCS

Programming hybrid nodes is often a formidable task and requires novel programming models and abstractions, such as multi-threading and multi-tasking. The necessity of running at scale on multiple hybrid nodes poses particular challenges for halo-update operations since data must be moved between different address spaces within and across nodes. The GHEX (Generic Halo-Exchange for Exascale) project aims at providing a library for performing halo-update operations in structured and unstructured grid applications, by abstracting the particular address spaces and various transport layers in order to enable transparent exchange of information. The objective is to allow different node architectures, with potentially different transport layers, to execute halo-update operations, but also provide support for multi-threading and multi-tasking programming models to better exploit the parallelism on the system. We will present the main components of the library and some initial results when running on hybrid CPU and GPU machines.

Constructing a Streaming SpMV Kernel for FPGAs
Naoya Maruyama, Lawrence Livermore National Laboratory

We discuss how SpMV can be implemented on FPGAs. Specifically, we present a streaming design of SpMV that attempts to optimize memory loads from a CSR matrix. We give preliminary results using Intel OpenCL on an Arria 10 FPGA board.

Session 5: Memory Management 

Locality and Hardware Topology Management in MPI
Mercier Guillaume, INRIA

Exposing Heterogeneous Memory Characteristics to HPC Applications
Brice Goglin, INRIA

Emerging memory technologies such as HBM or non-volatile memories, memory-side caches and PCI-attached memories bring a new level of complexity to HPC platforms. As envisioned with the Knights Landing processor, there is a new need to provide ways to identify which target memory should be used for specific buffer allocations. Unfortunately there are now many different cases that cannot be hardwired in applications anymore. We discuss in this talk the variety of heterogeneous memory architectures and propose to expose their characteristics to users as attributes, such as performance and locality.

Towards Intelligent Management of Heterogeneous Memory: A Reinforcement Learning Approach
Balazs Gerofi, RIKEN

The past decade has brought an explosion of emerging memory technologies. Various high-bandwidth memory types, e.g., 3D stacked DRAM (HBM), GDDR and
multi-channel DRAM (MCDRAM) as well as byte addressable non-volatile storage class memories (SCM), e.g., phase-change memory (PCM), resistive RAM (ReRAM) and the recent 3D XPoint, are already in production or expected to become available in the near future.

Management of such heterogeneous memory types is a major challenge for application developers, not only in terms of placing data structures into the most suitable memory but also to adaptively move content as application characteristics changes in time.

This talk advocates intelligent system level solutions for heterogeneous memory management and is composed of three parts. First, system support for a low overhead memory access tracking facility is introduced. Second, a brief overview of reinforcement learning (RL) is given. Finally, we provide discussions on RL's prospects in the domain of system level memory management.

Profile-guided scope-based data allocation method for heterogeneous memory architecture

The complexity of High Performance Computing nodes memory system increases in order to challenge application growing memory usage and increasing gap between computation and memory access speeds. As these technologies are used in HPC supercomputers only for a few years, it remains difficult to know if it is better to manage them with hardware or software solutions. Thus both are being studied in parallel. For both solutions, the problem consists in choosing which data to store on which memory at any time.

In this presentation, we propose a new profile-guided scope-based approach which reduces the data allocation problem complexity, thus enhancing the precision of state of the art analyzes. We have implemented our method in a framework made of GCC plugins, dynamic libraries and python scripts, allowing to test the method on several benchmarks. We have evaluated our method on an INTEL Knight's Landing processor. To this aim we have run LULESH, HydroMM, two hydrodynamic codes, and MiniFE, a finite element mini application. We have compared our framework performance over these codes to several straight-forward solutions: MCDRAM as a cache, in hybrid mode, in flat mode using numactl command and existing AutoHBW dynamic library.

Memory Layout Abstraction in Modern C++ for HPC
Bert Wesarg, Technische Universität Dresden, ZIH

The talk gives a short overview of three independent performance-portable C++ abstraction libraries for PGAS distributed data structures and algorithms, parallel kernel acceleration, and low-level memory accesses. These libraries are combined into a convenience programming-interface in the ongoing MEPHISTO project. The focus will be on the abstraction of memory accesses and we show the need for these abstraction in the HPC field. We than raise questions, how current and emerging memory hierarchies can be better utilized with memory access abstraction and how this could influence the future of the C++ language for HPC.