Previous-PADALs‎ > ‎PADAL16‎ > ‎


Session 1: Real Life Applications - Part I

Haohuan Fu, Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer

This talk reports our efforts on refactoring and optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) and clusters of computing processing elements (CPEs). To map the large code base of CAM to the millions of cores on the Sunway system, we take OpenACC-based refactoring as the major approach, and apply source-to-source translator tools to exploit the most suitable parallelism for the CPE cluster, and to fit the intermediate variable into the limited on-chip fast buffer. For individual kernels, when comparing the original ported version using only MPEs and the refactored version using both the MPE and CPE clusters, we achieve up to 22x speedup for the compute-intensive kernels. For the 25km resolution CAM global model, we manage to scale to 24,000 MPEs, and 1,536,000 CPEs, and achieve a simulation speed of 2.81 model years per day.

X. Sherry Li, Locality challenges for sparse factorization-based solvers

Factorization-based algorithms (e.g., sparse LU) play a significant role in developing robust, scalable solvers. Their memory access patterns are very different from the iterative solvers based primarily on sparse matrix-vector product. In this talk, we first highlight the computational, memory access and communication patterns of the two classes of sparse factorization methods: supernodal and multifrontal methods, for which we have developed the widely-used software packages, SuperLU and STRUMPACK which is further enhanced with hierarchical matrix algebra. We then present some techniques to exploit the newer heterogeneous node architectures, such as nodes with GPU accelerators or Intel Xeon Phi. We see significant challenges for these classes of algorithms to effectively utilize the new architectural features.

Johannes Langguth
Hierarchical partitioning of unstructured meshes in cardiac electrophysiology

Unstructured meshes are widely used in computational science and provide numerous advantages over structured meshes in many applications. W.r.t. data locality however, their irregular data structures and access patterns pose severe challenges, especially on modern heterogeneous clusters with deep memory hierarchies which are poised to become the standard in the foreseeable future. We discuss several of the challenges involved in hierarchical partitioning, such as preserving locality in complex accelerator-equipped nodes, load balancing between heterogeneous devices, and reordering for cache efficiency, as well as solutions to these problems implemented in a cell-centered finite volume simulation code from cardiac electrophysiology. Finally, we present our current steps on the way to a true hierarchically partitioning software for automatic data placement across heterogeneous clusters with varying node architectures.

Anshu Dubey, The Accelerated Climate Model for Energy (ACME) on Exascale Computers

Numerical models of the Earth's climate system are used to understand and predict future climate change and its impacts. The Accelerated Climate Model for Energy (ACME) is a US Department of Energy (DOE) project tasked with developing Earth System Models for simulating the feedbacks between the climate system and our energy use. Earth System Models require computational capabilities at the exascale and beyond and the ACME project is specifically targeted at utilizing exascale systems being deployed by the DOE, including GPU-accelerated and many-core systems. We will describe the ACME model and the challenges of developing these complex models for exascale machines. We will present our current approaches for managing memory and parallelism with results on today’s pre-exascale architectures. Finally, we will describe some new efforts to explore alternative programming models that may provide better tools for managing the memory hierarchies, increased parallelism and other complexities at exascale.

Session 2: Real Life Applications - Part II

Rio Yokota, Improving data-locality of fast multipole methods

Fast Multipole Methods have O(N) complexity but are compute bound. They have O(logP) communication complexity but have highly asynchronous data-dependencies. These features make it an interesting alternative for solving elliptic equations on large scale computers of the next generation. However, there is ample room for increasing the data-locality of FMM. A few examples will be described in this talk.

Mauro Bianco,
Generative Programming for Complex Stencil Applications

Stencil computations are classical patterns in scientific computing. Developing efficient and portable applications for simulating complex phenomena can be challenging. These applications need executing several stencils on several input fields, and, while trivial implementations turn out to be bandwidth limited, more efficient solutions requires intricate implementations. By using generative programming techniques, the intricate implementation can be left to library developers, while application level code can be more abstract. We illustrate how these techniques can be made effective in the GridTools libraries to efficiently run complex stencil-like computations on multiple platforms.

Michelle Strout, The Loop Chain Abstraction for Specifying Data Locality

Exposing opportunities for parallelization while explicitly managing data locality is the primary challenge to porting and optimizing existing computational science simulation codes to improve performance and accuracy. The most popular programming models in these codes such as MPI require that programmers explicitly determine the data and computation distribution. This has led to good scaling between nodes, but parallelism and locality are needed within the node as well. There are many approaches for implementing shared memory parallelism, but with most of them it is the programmer's responsibility to group computations to improve data locality. We propose the loop chain abstraction for providing compilers with sufficient information to automate the parallelism versus data locality tradeoff. The programmer specifies data access patterns for the sequence of loops in the loop chain thus enabling the compiler to group iterations across loops as well as within individual loops to tune the tradeoff between parallelism and locality. In this talk, I will present our initial results in using and automating the loop chain abstraction within the context of Computational Fluid Dynamics applications.

Hatem Ltaief,
Locality-Aware Cholesky Factorization on Manycore Architectures

The talk will describe various variants of the Cholesky factorization applied on dense and block low rank matrices using locality-aware scheduling techniques. Impacts on performance will be reported on Intel KNL and NVIDIA GPUs in the context of spatial statistics and computational astronomy applications.

Session 3: Accelerators and Emerging Processor Technologies - Part I

Mohamed Wahib, Daino: A High-level Framework for Parallel and Efficient AMR on GPUs

Adaptive Mesh Refinement methods reduce com- putational requirements of problems by increasing resolution for only areas of interest. However, in practice, efficient AMR implementations are difficult considering that the mesh hierarchy management must be optimized for the underlying hardware. Architecture complexity of GPUs can render efficient AMR to be particularity challenging in GPU-accelerated supercomputers. This presentation presents a compiler-based high-level framework that can automatically transform serial uniform mesh code annotated by the user into parallel adaptive mesh code optimized for GPU-accelerated supercomputers. We also present a method for empirical analysis of a uniform mesh to project an upper- bound on achievable speedup of a GPU-optimized AMR code. We show experimental results on three production applications. The speedups of code generated by our framework are comparable to hand-written AMR code while achieving good and weak scaling up to 1000 GPUs.

H. Carter Edwards, Kokkos Spaces: Expressing Locality in Heterogeneous Node Architectures

The Kokkos programming model includes abstractions for execution spaces, memory spaces, and accessibility relationships among these spaces. For example, a compute node's main memory is a Kokkos Host memory space, parallel computations can be run on CPUs using Serial, Threads, or OpenMP execution spaces.When a compute node has an attached NVIDIA GPU a Kokkos Cuda memory space is available and parallel computations can be run on a Cuda execution space. Development work is underway to enhance Kokkos' execution and memory space abstractions to partition a node's NUMA-affinity memory and CPU cores into decoupled instances of Host memory and execution spaces; access multiple attached GPUs as instances of Cuda memory and execution spaces;and query locality relationships between these spaces. This enhancement will enable applications to dispatch multiple concurrent heavy tasks to different space instances, with each task leveraging internal parallelism. This enhancement is required to support coarse-gain, inter-node tasking programming models such as Legion, Uintah, and DARMA. We will present requirements, design, and prototyping lessons learned for this under-development enhancement to Kokkos spaces.

Tobias Fuchs,
Locality Hierarchies - An Abstraction of Locality in Distributed, Heterogeneous Systems

Single node hardware design is shifting to a heterogeneous nature and many of today’s largest HPC systems are clusters that combine accelerators in heterogeneous compute device architectures. The need for new programming abstractions in the advancements to the Exascale era has been widely recognized and variants of the Partitioned Global Address Space (PGAS) programming model are discussed as a promising approach in this respect DASH is a C++ template library that provides distributed data structures with support for hierarchical locality in a PGAS programming model. Portable efficiency, an essential goal in the design of DASH, can only be achieved with programming abstractions of hardware locality that allow to optimize data structures and algorithms to the underlying system at compile- and run time. Established tools like LIKWID and hwloc provide reliable interfaces to query the hardware topology on node level but fail to construct a global representation of distributed locality domains and do not support accelerator architectures like Intel MIC. We present Locality Hierarchies, an abstraction of distributed, hierarchical locality represented as a modifiable data structure. The underlying model supports heterogeneous systems as first-class use case and introduces a well-defined concept of distance for arbitrary distributed hardware hierarchies. Using common range-based algorithms as motivating examples, we explain how our approach facilitates locality-aware load-balancing and process mapping on SuperMIC compute nodes.

Hal Finkel, A story about restrict, OpenMP, and the limits of abstraction in C++

An important goal of modern programming environments is to isolate the programmer from details of the target architectures while providing portable performance. Most high-performance-computing applications are written using C/C++/Fortran, many adorned with OpenMP directives. In this context, how effective can a compiler be at hiding architectural details while providing portable performance? The answer to this question depends on how the compiler is implemented, but it also depends on the semantics specified for the base programming language and the OpenMP directives. There are also dependencies on subjective factors, such as a compiler vendor's understanding of user expectations and tolerances. I'll talk about how all of these factors combine to place limits on the level of abstraction the compiler can provide. I'll also discuss developments within the C++ standards committee, including parallel algorithms and alias-set attributes, that affect the way high-performance abstractions can be created. In addition, I'll also discuss how some applications on which I work are choosing to make the abstraction vs. explicit management trade off.

Session 4: Accelerators and Emerging Processor Technologies - Part II

Jinpil Lee, Task Parallelism Support in the XcalableMP PGAS Language for Many-core Clusters

Many-core architecture, consisting of a number of cores grouped into multiple NUMA nodes, is widely used in modern HPC systems. Task parallelism using OpenMP dependent tasks is a promising programming model for many-core architecture because it can exploit parallelism in irregular applications with fine-grain synchronization. In this talk, we introduce XcalableMP, a PGAS language for cluster computing, and its new language construct for describing task parallelism. The extended task syntax specifies execution dependencies among tasks which may cause synchronization between nodes. The presentation will show the basic concept of task parallelism in XcalableMP and how to exploit memory locality in many-core clusters with NUMA architecture.

CJ Newburn
, Applying hierarchy to scaled systems

Many applications involve a high degree of communication; the speedups they can achieve on HW platforms is limited by how well those platforms support strong scaling. Even applications which exhibit significant phases of independence often need to communicate periodically, and the efficiency of weaker scaling hinges on how well tuners manage locality and communication. In this talk, we propose topologies that enable greater scaling, and highlight tasking framework and programming model features needed to enable a combination of strong and weak scaling. The hierarchical system design is supported by a combination of features that enable universal load/store access to distributed memories, and that can manage the scope of synchronization. These are matched with programming model features that ease access to distributed memory but that can minimize the scope of synchronization. Finally, we provide an evaluation of some workloads from high performance computing and deep learning that illustrate the benefits of such hardware and programming model support.

Naoya Maruyama, FPGA Performance Evaluation and Locality Optimization

Recently, programmability of FPGAs is becoming more approachable thanks to the availability of high level synthesis tools. Most notably, both of the two major vendors of FPGAs now provide a compiler for OpenCL targeting their own FPGAs. Thus, although OpenCL may not be the best programming interface for application development, FPGAs can now be used similarly with OpenCL as other accelerators such as GPUs. However, it still remains to be seen whether such usage of FPGAs gives performance advantages over existing accelerators. We will present OpenCL performance optimization for FPGAs with comparison against GPUs and CPUs. In particular, we will show that how fine-grained locality can be exploited on FPGAs, which has been demonstrated to be crucial for performance optimization in FPGA programming.

Didem Unat,
Leveraging AMR Metadata in Aynschronous Runtime

Hiding communication overheads is widely known as one of the most challenging tasks in optimizing parallel codes. As hardware architectures evolve, this problem is expected to become even more pronounced. We present Perilla, a data- driven runtime system for task graph-based execution, allow- ing the application to tolerate communication delays at an af- fordable programming cost. Unlike most existing runtime systems, Perrilla is specially tailored to an AMR software framework. Leveraging high-level knowledge on data com- munication and locality embedded in the framework enables a unique design and implementation of Perilla on complex node architectures, as well as facilitating the process of spec- ifying a task graph. Both experimental and analytical results on a multigrid solver show that the programmer can achieve significant speedups with only a modest amount of code mod- ification. Perilla also serves as an example of how one can harness computing resources of a future exascale system with domain specialization.

Kenjiro Taura, Scrutinizing Locality, Load Balancing, and Overhead Trade Offs in Task Parallel Runtime Systems

With increasing concurrency both within and across compute nodes, task parallel programming models are believed to play increasingly important roles in HPC applications. By dynamically scheduling tasks both within and across cores (potentially across nodes), it achieves good load balancing and tolerance to unpredictable latencies due to communication, noises, synchronizations, etc. It is challenging, however, to reconcile the benefits of dynamic scheduling and locality; dynamic scheduling is basically about mapping ready tasks to available cores based on unpredictable conditions that arise at runtime while maintaining locality is largely about, in microscopic terms, binding tasks to vicinity of data they access. There are thus many proposals to make dynamic schedulers locality-aware; one forms a hierarchy of resources and tries to move tasks between cores close to each other; another tries to propose an API to explicitly specify affinity of tasks to cores; yet another proposal maintains working set size of tasks running on a group of cores sharing a last level cache. Each of these proposal, in one way or another, constrains where tasks can move and adds a slight overhead to each tasking operation whose overhead must be minimum. It is thus important to assess effects and pitfalls of these locality-aware schedulers in light of trade off between locality benefits and potential loss of processor utilization; yet, we don't have right tools and conceptual frameworks to precisely understand such trade offs. As a step toward this goal, we have been developing a tracing tool for task parallel programs, DAG Recorder, and its analysis/visualization tool DAGViz, which can precisely record where and when tasks executed. In this talk, we demonstrate that this tool helps us analyze the aforementioned trade offs; it can display performance loss due to lost processor utilization (the amount of time processors unnecessarily waste despite there are tasks ready to execute); it can also display elapsed time and performance counter values of individual tasks, allowing us to understand the effect of task scheduling to cache miss counts, for example. Combined, we are hoping the tool can help us scrutinize effect of locality-aware schedulers.

Session 5: Performance Modeling for Data Locality   

Cy P Chan, Locality-Aware Performance Optimization and Modeling of Adaptive Mesh Refinement Codes for Exascale

The solution of partial differential equations using adaptive mesh refinement (AMR) has proven to be a computationally efficient approach for simulating a broad range of physical phenomena in which feature sizes vary widely in scale. Significant effort has gone into tuning AMR-based applications for current high performance computers; however, the push towards exascale computing is forcing changes in the way these high performance systems are designed. The evolution in hardware design for exascale will compel scientists to reexamine several core aspects of AMR in order to achieve high performance on future architectures, including a) greater attention to data locality and b) hiding communication latency through the increased use of asynchronous task-based execution. We introduce two new modeling tools: ProgrAMR, which produces skeletonized task-graph representations from high-level expressions of algorithms, and Mota Mapper, which is a Multi-Objective Topology-Aware data mapping library that we use to map block-structured AMR boxes to network nodes in a way that simultaneously reduces communication costs while balancing compute loads and memory constraints. Using these tools, we examine the implications of changes in system architecture and execution models on the performance and scalability of multigrid linear solvers, whose execution typically contributes a significant fraction to the total simulation time of AMR applications that utilize them.

Pietro Cicotti, ADAMAN, tools for characterizing and managing data movement

In HPC and extreme-scale computing, optimizing data movement is a critical aspect of performance and energy efficiency. New tools are needed to cope with the increasing complexity of understanding and managing locality, while dealing with deep and heterogeneous memories, communication, and layered storage systems. In this talk I will introduce the ADAMANT project, a toolkit for characterizing and managing data movement, and present progress to date with example applications.

Emmanuel Jeannot, Process placement, metrics and models: an experimental study

In this talk, we will present experimental results gathered about process placement. We will report how process placement impact performance, how we can model performance and how the performance is related to high-level metrics.

Shigang Li, Cache-Oblivious MPI All-to-All Collectives

 In the multi/many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the cache complexity. We propose cache-oblivious algorithms for MPI_Alltoall, MPI_Allgather, and their neighborhood counterparts. To exploit data locality, data blocks are copied into the receive buffers in Morton order in parallel. We provide a detailed cache complexity analysis for the proposed algorithms, and prove that the algorithm for MPI_Alltoall is asymptotically optimal. We also propose NUMA-aware algorithms, which are still cache-oblivious within each processor, to minimize the total distance of data transfer. Experimental results show that our cache-oblivious implementations achieve significant performance improvements over the naive implementations based on shared heap for small and medium block size. For MPI_Alltoall, our implementation outperforms MPICH3 by 211% on average on Intel Xeon Phi; outperforms MVAPICH2 by 203% on average on Intel Xeon E7-8890; outperforms MVAPICH2 by 123% on average on a 256-node Xeon E5-2680 cluster for block sizes less than 1KB.