PADAL17‎ > ‎

Abstracts

Submit your abstract by July 5th.

Session 1: Applications


Data-related Optimizations for Various Applications on the Sunway supercomputing System
Lin Gan, Tsinghua University, and the National Supercomputing Center in Wuxi

As the first computing system in the world to provide a peak performance of over 100PFlops, and to have more than 10M computing cores, the Sunway TaihuLight supercomputer has been widely applied in many major applications, such as climate modelling, geophysics exploration, life science, etc. As the major processor to form the system, the SW26010 CPU is a homegrown many-core chip that contains 260 cores as well as a hierarchy of different memories. In order to maximise the computing efficiencies and to scale up the performance, programmers have to consider as many optimising methods as possible, when dealing with various kernels from different applications. Data-related optimisations that fit well with the many-core architecture is therefore important and necessary. This talk will cover a set of novel optimising techniques that are proposed to maximise the data efficiencies, such as the tunings on the data locality, reuse, and the wise trade-off for the data movement. A lot of applications have been benefited from these methods to have achieved significant performance improvement and inspiring scalabilities.


Benchmarking Stencil on GPUs       
Naoya Maruyama, Lawrence Livermore National Laboratory

Stencil is one of the most common computing patterns in scientific computing. Its low arithmetic intensity often requires extensive optimization for reducing memory access overheads. We have previously reported a case study using older generations of NVIDIA GPUs, which presented several effective (and not very effective) optimization techniques. In this talk, we will update the performance study with newer generations of GPUs and discuss key architectural differences in those accelerators from a perspective of memory bound computations.

Design and Optimization of a Multi-stencil CFD Solver       
Aparna Chandramowlishwaran University of California, Irvine             

Solving Navier-Stokes equations is a challenging multi-scale multi-physics problem. Our solver is a realistic structured mesh finite-volume code and presents many challenges of real applications. Specifically, our solver consists of multi-stencils with different computational intensities and memory access patterns. While the challenges of stencil-stencil kernels still apply to this problem, solving a multi-stencil kernel poses new challenges, since each stencil has a unique pattern and requires access to different neighbors. In this talk, I will outline how multiple stencils impact optimization and discuss techniques for improving locality and parallelism by trading off redundant work. We also ask whether CFD applications can be expressed in stencil DSLs and whether such an implementation can deliver a sufficient combination of optimizations to compete with a hand-tuned code and what are its limitations.

Need for data locality in deep learning     
Prasanna Balaprakash, Argonne National Laboratory        

The potential of machine learning (ML) has prompted hardware vendors to design and develop new custom hardwares (such as GPUs, TPUs, and FPGAs) to accelerate ML applications. Consequently, the speed of advancement of these chips has exceeded the ability of ML algorithms to fully utilize them. In particular, many custom chips provide optimizations for data locality and they need algorithms with significant data locality---however, not all machine learning workloads are currently up to the task. Several ML algorithms do not expose enough locality to fully exploit the new generation of hardware. In this talk, we will give a high-level overview on a class of deep learning algorithms that do not expose data locality and discuss possible strategies to address the problem.

Blocking Optimization for Tensor Decomposition      
Jee Choi, IBM Research
    
Many social and scientific domains give rise to data with multi-way relationships that can naturally be represented by tensors, or multi-dimensional arrays. Decomposing – or factoring – tensors can reveal latent properties that are otherwise difficult to see. However, due to the relatively recent rise in popularity of tensor decomposition in HPC, its challenges in performance optimization is poorly understood. In this presentation, I will explain the steps taken to identify and isolate the major bottlenecks in tensor decomposition algorithms, and demonstrate significant speedup over prior state-of-the-art using various cache blocking mechanisms.

Session 2: Memory Technologies

Exposing the locality of new memory hierarchies to HPC applications              
Brice Goglin, Inria Bordeaux
 
Memory hierarchies gets deeper and larger, with multiple NUMA nodes and different memory technologies. Many hardware details could be exposed to HPC users to ease their task and data placement decision. Hardware characteristics such as latencies and bandwidths, or the path(s) between cores and each memory bank, could be used as decision criteria. However, users rather want simple abstractions to understand this hardware organization. We will present how we try to expose such information in the hwloc tool in a simple and convenient but still exhaustive manner.

System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy   
Francois Tessier, Argonne National Laboratory

Supercomputers are increasingly being architected with deeper and diverse memory subsystems such as NVRAM, MCDRAM and network-attached burst buffers, and interconnect topologies such as dragonfly and Tori. From an application perspective, this leads to an increased complexity to fully exploit these system characteristics for data movement. Our research tackles this area and one focus area is on effective abstractions of the various memory and storage tiers, as well as the network topology, to ensure scalable and portable performance.


Bandwidth and Object Placement Studies on HBM     
Didem Unat, Koç University     

High Bandwidth Memory (HBM) in HPC systems allows memory-bound applications to perform better. Due to the limitation in size, HBM systems work in tandem with DRAM which acts as the high capacity memory component. We study the effect of such heterogeneous systems on HPC application. In particular, we study Intel KNL and its HBM, namely MCDRAM. By performing various benchmarks, we acknowledge that the bandwidth of MCDRAM is roughly 5x of a traditional DDR. With such significant gains, it becomes important to devise an allocation scheme for the objects in an application. We conclude that the read speed for the different memories on KNL is higher than the write speed. Therefore, for maximizing the performance of an application, we suggest to keep the write intensive data on MCDRAM. We also propose a placement algorithm which takes into account the individual read and write reference counts and the sizes of all objects in the application.

Exploiting Data Locality for Unified CPU/GPU Memory Space using OpenMP     
Lingda Li, Brookhaven National Laboratory       

To facilitate GPU programming and accelerate applications with large working set, recent GPUs support unified CPU/GPU memory space addressing. In such systems, CPU and GPU can conveniently address each other's memory and data is moved between different memory on demand. However, our investigation shows that the current data migration mechanism is not aware of data locality, resulting in poor performance and redundant data movement. The upcoming OpenMP 5.0 will include new data locality features to address the complex memory hierarchy in today's systems, however, current proposed features do not take unified memory into consideration and cannot address its performance problem.
To solve this problem, we propose to extend OpenMP data locality features to improve unified memory performance. The proposed extensions expose different GPU memory management choices for programmers to exploit GPU data locality explicitly. In scenarios with complex data locality, programmers are also allowed to pass hints so that OpenMP compiler and runtime make better GPU data management decisions. Preliminary evaluation shows that our proposal can significantly improve GPU performance and reduce redundant data movement between CPU and GPU for benchmarks with large working set.

Deep Memory Abstractions: Beyond Caches and NUMA Nodes        
Swann Perarnau, Argonne National Laboratory   

Future platforms are expected to feature several layers of memory, with a mix of on-package (MCDRAM or HBM), regular DDR and NVRAM resources available to applications depending on the platform. To use these deep memory hierarchies efficiently, applications will need to integrate portable and efficient memory management strategies dealing with placing and moving data according to computational needs.

Many vendors currently advocate for the hardware and operating systems to abstracts these deep memory layers either as new cache levels (KNL cache mode) or as NUMA topologies (KNL flat mode). These abstractions are familiar to users, and require less development work for users overall. Unfortunately, their implementations are costly, prone to errors and result in set-in-stone policies that might not provide the best performance for some data access patterns.

In this talk, I will review some of the issues the current abstractions for deep memory exhibit. I will then advocate instead for the development of system-level abstractions that explicitly integrate information relevant to topology, data placement and data migration. Such abstractions might then be used by programming languages or runtimes to design better memory management policies transparent to end users.

Chapel's New Adventures in Intra-node Locality   
Brad Chamberlain, Cray Inc.

I'll start this talk with a brief review of Chapel's key concepts for expressing data locality: locales, on-clauses, and domain maps.  From there, I will describe our recent implementation efforts tuning for, and giving the user control over, locality on current processor architectures such as those with multiple NUMA domains and the MCDRAM available on Intel Xeon Phi ("KNL").  As time permits, I will attempt to shine a light on the interplay between memory locality and the network interface for PGAS languages like Chapel.  I expect to back up my characterizations with performance measurements from benchmarks and/or proxy applications as much as possible.

Extreme Heterogeneity and Flexibility Challenges for Data Locality
Andrew A. Chien, University of Chicago

Technology scaling is driving extreme customization (multiple accelerators) and flexible configurability (memory structure). We will highlight these trends, and the challenges they represent for programming and efficiency.


Session 3: Runtime Systems

Kokkos Task-DAG: Memory Management and Locality Challenges Conquered   
H Carter Edwards, Sandia National Laboratory


Kokkos' new task-DAG capability supports the portable parallel execution of a dynamic, heterogeneous set of fine-grain tasks with dynamic directed acyclic graph (DAG) of dependences. Significant memory management and memory locality challenges had to be conquered for portability, performance, and usability to attached GPUs. First, tasks must be scheduled and execute on the GPU's lean runtime without intervention or interaction with the CPU.Second, a dynamic task-DAG requires a thread-scalable memory management that operates within a bounded span of GPU memory. Third, the GPU does not provide coherency among core's L1 caches. Fourth, algorithms written using the task-DAG capability must be insulated from all of these challenges.


Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures     
Hatem Ltaief, KAUST    

We introduce the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a new fine-grained formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the novel task-based implementation is capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for better data locality and hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations (i.e., Intel MKL and Elemental) for the polar decomposition on latest shared-memory vendors' systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy.

Addressing Data Locality via Adaptive Runtimes and Higher Level Abstractions    
Laxmikant (Sanjay) Kale, University of Illinois at Urbana-Champaign     

Data locality is increasing in importance in HPC. I will argue, based on experience with Charm++ and other higher-level abstractions, and on multiple parallel applications that (a) encapsulation provided by over-decomposed entities  promotes locality, and often without need for hierarchical abstractions (b) persistence is an important property of HPC applications that helps in management of locality, especially via adaptive RTSs. (c) macro-dataflow as an intermediate representation empowers an aRTS to do effective locality-aware scheduling and data-movement. I will also express my views on how and which higher level abstractions can help improve locality.

Data Locality - Old, New and Future in Legacy Fortran and C++ codes     
Martin Berzins, SCI Institute, University of Utah     

Data Locality in legacy Fortran and C++ code is addressed from the point of view of rethinking the design of a potential Navy next generation weather code which uses legacy Fortran and from the point of view of the asynchronous task-based Uintah framework which has legacy C++ components. In particular  we look at the performance of the code on i"modern architectures such as both GPUs and Intel KNLs. At the same time in doing so we look forward to how the codes can evolve to future architectures beyond the existing targets of today. In both cases we look to how the required performance portability may be achieved with minimum effort on the users part by changes to how tasks are executed and in particular how this may be done by the runtime system and what are the costs of doing so in a portable way.


HiHAT: Retargetable Infrastructure for Tomorrow's Systems
CJ Newburn, NVIDIA

The increase of diversity in future platforms breeds complexity.  There will be more kinds of specialized computing elements, and the memory system will be composed of more kinds, more layers and more locations.  Software abstractions enable application developers to enjoy the benefits of this specialization without having to become experts in a myriad of technologies.  ""Ninja"" experts will be able to provide highly-tuned access to each new feature, and those operating in a tuning role will be able to configure software infrastructure to dispatch user requests to target-specific implementations.
A new community-wide project called HiHAT, for Hierarchical Heterogeneous Asynchronous Tasking, serves as retargetable infrastructure underneath tasking and language runtimes, and enables them to be ported to make optimal use of new architectural features.  We introduce HiHAT's design, and show how it provides a way to make best use of innovative new technologies.  We present an initial HiHAT prototype, evaluate its overheads, and demonstrate its applicability to a sample application, molecular orbitals, on a system with a mix of CPUs and GPUs and different kinds of memory.


Data Locality Challenges in Irregular Applications for Exascale Programing          
Min Si, Argonne National Laboratory         

Although the well-known Bulk Synchronous Parallel model still takes an important role in HPC application programming, a completely data-driven and irregular computation model is being utilized by a set of applications especially in chemistry and bioinfomatics domains. The model usually relies on a global shared address space concept where the physical distribution of application data is hidden from application algorithms, and the parallelism is based a random distribution mode that every process randomly requests a computation task for the next chunk of the ``global shared'' data in parallel to achieve better computation load balancing. This model, however, may fall into critical communication bottleneck when scientists start looking into more complex domain problems in exascale computing which show an increasing demand for data movement. It this talk, we focus on the communication model in the irregular applications and discuss potential optimizations for exascale computing by utilizing data locality techniques.



Session 4: Performance Modeling


Performance model generation: Challenges and Approaches       
Boyana Norris, University of Oregon       

Data locality in modern architectures is often difficult to exploit because memory is used and managed by multiple software and hardware layers, mostly outside the control of the HPC application developer. Performance tools provide rather coarse grained view of memory performance, which is not necessarily helpful in pinpointing locality-related problems or optimizations. In this talk, we overview some approaches to generating parameterized performance bounds that can produce models of memory and overall performance that is not specific to a single architecture. The goal is to enable better understanding of the data locality characteristics of a particular implementation without relying on noisy and coarse-grained measurements.

Coarse-Grained Data Locality Optimization Algorithm       
Mohamed Wahib, AIST-TokyoTech Open Innovation Lab     

Optimization for data locality, targeted by compilers (e.g. loop fusion) and programmers (e.g. loop tiling) is traditionally limited in granularity to affine code regions. On the other hand, some categories of programs can benefit from static data locality optimization when applied at a coarser granularity; code in loosely coupled segments. That includes, for example, a) applications with memory-bounded kernels offloaded to discrete accelerators, b) coarse-grained pipelined stream programs, and c) static tasks of task-based parallelism models deployed in manycore processors. Such coarse-grained optimization comes down to applying a transformation based on fission and fusion of the loosely-coupled code segments under constraints of inter-dependencies, precedencies, and architecture features. However, the solution space for this type of optimization can grow exponentially, rendering deterministic methods intractable for large-sized real-world applications. In this talk, we introduce an efficient evolutionary algorithm, LOGGA (Locality Optimization Grouped Genetic Algorithm), designed for coarse-grained data locality optimization.

Unveiling the Interplay Between Global Link Arrangements and Network Management Algorithms on Dragonfly Networks   
Vitus J. Leung, Sandia National Laboratories

Network messaging delay historically constitutes a large portion of the wall-clock time for High Performance Computing (HPC) applications, as these applications run on many nodes and involve intensive communication among their tasks. Dragonfly network topology has emerged as a promising solution for building exascale HPC systems owing to its low network diameter and large bisection bandwidth. Dragonfly includes local links that form groups and global links that connect these groups via high bandwidth optical links. Many aspects of the dragonfly network design are yet to be explored, such as the performance impact of the connectivity of the global links, i.e., global link arrangements, the bandwidth of the local and global links, or the job allocation algorithm.

This paper first introduces a packet-level simulation framework to model the performance of HPC applications in detail. The proposed framework is able to simulate known MPI (message passing interface) routines as well as applications with custom-defined communication patterns for a given job placement algorithm and network topology. Using this simulation framework, we investigate the coupling between global link bandwidth and arrangements, communication pattern and intensity, job allocation and task mapping algorithms, and routing mechanisms in dragonfly topologies. We demonstrate that by choosing the right combination of system settings and workload allocation algorithms, communication overhead can be decreased by up to 44%. We also show that circulant arrangement provides up to 15% higher bisection bandwidth compared to the other arrangements; but for realistic workloads, the performance impact of link arrangements is less than 3%.

Towards provably locality-enhancing work stealing algorithms   
Kenjiro Taura, University of Tokyo

The random work stealing is an algorithm of choice for load-balancing dynamically generated, irregular, and/or potentially unbalanced workloads among cores on a shared-memory node.  It performs well primarily because (1) it has no central bottlenecks and (2) it preserves a certain degree of temporal localities that exist in the serial execution, in the sense that tasks executed back-to-back in the serial execution tend to be executed back-to-back on the same core in parallel executions as well.  Yet there will be little hope that the original random work stealing, which will choose a victim worker uniformly among all workers, will scale to thousands of cores, let alone thousands of distributed memory nodes, due to the uniformly random communication patterns the task migration itself causes and the resulting remote data access patterns that will not respect machine hierarchies.  It is thus natural to ask if we can improve this "uniformly random" work stealing into a locality-aware one whose task migrations better respect the machine's locality hierarchy.  A number of heuristics has been proposed, but we are not aware of any algorithm that gives a provable guarantee whose running time is better than uniformly random victim selection.  In this talk, I shed some lights on why such an algorithm is not easy to come by and explain an algorithm we have worked on so far.



OpenMP, Unified Memory, and Prefetching       
Hal Finkel, Argonne National Laboratory    

State-of-the-art accelerator hardware, such as the NVIDIA GPUs that will power upcoming supercomputers, will support unified virtual memory (UVM). This will allow the GPU to fetch data from the host via on-demand paging, both for convenience and to support working sets larger than the device's memory. While OpenMP's accelerator offload model generally must be implemented by fully copying mapped data to and from the device, with UVM this is no longer the case. In fact, the runtime library does not need to explicitly copy any data at all. However, on-demand paging proves to be an expensive way to access data. Using cost-model-driven software prefetching, and automated pipelining, informed by profile-guided compiler analyses, we can mitigate the costs of on-demand paging while still reaping the benefits of UVM. This talk will discuss the implementation of these techniques within LLVM and its OpenMP runtime along with some performance results.

Matrix-Free Techniques for Fast PDE Solvers   
Karl Rupp, Argonne National Laboratory
  
  
Essential techniques for high-performance matrix-free residual computations in numerical partial differential equation (PDE) solvers are presented via two example applications. First, a variable-coefficient Laplace problem motivates a technique we call "thread transposition" to vectorize over elements on GPUs. Second, a Q2-P1 mixed finite element discretization of the Stokes equations outlines vectorization opportunities on CPUs and GPUs for tensor bases. These examples show how small details can lead to substantial differences in the implementations, hence one must not expect to find a one-size-fits-all solution for PDE solvers.

Session 5: Compilers and DSLs


How Smart Should Compilers Be When Handling Abstractions for Locality?   
Michelle Strout, University of Arizona    

As early as the HPF language, there have been mechanisms to specify to compilers how to partition data, layout data, or other locality-centric information.  Examples of these mechanisms include type declarations, pragmas, and function calls.  These mechanisms require the compiler to contain various levels of complexity (or smarts).  For an example of a compiler needing “smarts”, the HPF compiler was responsible for dividing up the computation given a programmer-specified domain decomposition.  In the Loop Chain project, we are currently developing loop chain pragmas for scheduling across loops in a way that balances data locality and parallelism.  We have an ongoing debate about what programers should specify in terms of the loop chain schedule versus what the compiler should figure out.  In this talk, I overview how loop chain pragmas navigate this tradeoff and compare that with some other examples.

The Loop Chain Abstraction: A Demonstration   
Catherine Olschanowsky, Boise State University   

Exposing opportunities for parallelization while explicitly managing data locality is the primary challenge to porting and optimizing existing computational science simulation codes to improve performance and accuracy. OpenMP provides many mechanisms for expressing parallelism, but it primarily remains the programmer's responsibility to group computations to improve data locality. The Loop Chain abstraction, where a summary of data access  patterns are included as pragmas associated with parallel loops, provides compilers with sufficient information to automate the parallelism versus data locality tradeoff.

In this talk I will present the syntax and semantics of  loop chain pragmas for indicating information about the loops that belong to the chain and orthogonal specification of a high level schedule (i.e., transformation plan) for the whole loop chain. A computational fluid dynamics mini-application will be used to demonstrate the pragmas.

Compiler-Based Code Generation for Stencil Computations with Bricked Data Layout
Protonu Basu, Lawrence Berkeley National Laboratory


Optimizing stencil computations on modern architectures with wide SIMD needs several code optimization techniques to improve data locality, manage computation intensity and generate wide SIMD friendly code. We present an automated code generation technique which uses a bricked data layout. A structured grid is laid out as collection of small bricks, with brick volume being a multiple of SIMD width and an autotuning candidate. Furthermore, bricks lack ghost zones which differentiates this technique from domain decomposition. Thus stencil computations involve a collection of neighboring bricks. We discuss stencil optimizations with the bricked data layout in the context with techniques such as iteration space tiling, data tiling, stencil reordering and low-level optimizations to target intrinsics or generate native-compiler friendly code. Finally we discuss how this can be used to implement domain specific languages or tools.

Two-way Python-Fortran transpilation for performance and maintainability         
Mateusz Bysiek, Tokyo Institute of Technology  

Python enables an easier expression of complex computational problems. However, this simplicity of high-level languages is usually associated with a huge performance penalty. In this talk, we introduce an HPC-oriented approach to bidirectional source-to-source translation between Fortran and Python, which opens up the possibility of migrating legacy scientific code (written in Fortran) to Python while preserving its original performance. Furthermore, we will give an outlook into potential future applications of this approach, with emphasis on abstractions of data locality.

DAPP - Data-Centric Parallel Programming for Shared and Distributed Memory Systems   
Tal Ben-Nun, ETH Zurich   


With the rapidly-changing landscape of computing architectures and systems, application programming complexity is increasing beyond the skill-set of the average scientist. We present DAPP, a unified programming abstraction and system that facilitates performance-portable heterogeneous computing on distributed memory environments. DAPP consists of a three-stage pipeline, transforming programs into specialized per-architecture code by applying optimizations such as tiling, fusion and vectorization. This program transformation is made possible by decoupling memory access from processing in the programming phase, and using performance modeling for specialization and tuning. Specifically, we formulate a Data-Centric Intermediate Representation (DCIR) that encodes data movement costs and enables flexible, potentially custom, program optimization routines. DAPP programs are implemented over the Python programming language, commonly used in scientific applications and industry. We show how Python can be extended without modifying the language itself, using runtime AST parsing and modification for code transformation purposes. We demonstrate the capabilities of DAPP on several applications in linear algebra and climate modeling, showing that it maintains both expressiveness and high performance.




Comments