Program (with abstracts)

9:00 - 10:00 — Welcome and Invited Keynote

Geoffrey Mainland and Tiark Rompf.

Chairs' Welcome to FHPC 2015

Milind Kulkarni (Purdue University).

Regularizing the irregular: analyses and transformations for recursive, irregular applications. (invited keynote)

10:00 - 10:30 — Break

10:30 - 11:20 — Code Generation and Auto-tuning

Naoki Takashima, Hiroki Sakamoto andYukiyoshi Kameyama.

Generate and Offshore: Type-safe and Modular Code Generation for Low-Level Optimization

We have designed the Asuna system which supports implicitly heterogeneous multi-stage programming in MetaOCaml. Using this system, programmers can write code generators in a high-level language, and generated code can be translated to a program in low-level languages such as C and LLVM. The high-level code generators can make use of all the features of MetaOCaml such as algebraic data types and higher-order functions while the generated code may include low-level CPU instructions such as vector arithmetic operations.

We can write program in the modular and type-safe programming style and can directly represent low-level optimizations at the same time. Asuna is a multi-target system, that means a single code generator can generate code in C or LLVM, without changing the generator. The translation by Asuna preserves typing and every generated code is guaranteed to be well typed and well scoped.

In this paper, we explain the practical side of Asuna, by showing that how one can implement low-level optimizations using the examples of matrix multiplication. We also show that this techniques can be applied to other examples such as Strassen's algorithm and Fast Fourier transformation.

Michael Vollmer, Bo Joel Svensson, Eric Holk and Ryan Newton.

Meta-Programming and Auto-Tuning in the Search for High Performance GPU Code

Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large.

Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell).

Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces.

We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code.

11:20 - 11:40 — Coffee Break

11:40 - 12:30 — Automatic Partitioning

Bo Joel Svensson, Michael Vollmer, Eric Holk, Trevor McDonell and Ryan Newton.

Converting Data-Parallelism to Task-Parallelism by Rewrites: Purely Functional Programs Across Multiple GPUs

High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel.

Frederik Meisner Madsen, Robert Clifton-Everest, Manuel Chakravarty and Gabriele Keller.

Functional Array Streams

Regular array languages for high performance computing based on aggregate operations provide a convenient parallel programming model, which enables the generation of efficient code for SIMD architectures, such as GPUs. However, the data sets that can be processed with current implementations are severely constrained by the limited amount of main memory available in these architectures.

In this paper, we propose an extension of the embedded array language Accelerate with a notion of sequences, resulting in a two level hierarchy which allows the programmer to specify a partitioning strategy which facilitates automatic resource allocation. Depending on the available memory, the runtime system processes the overall data set in streams of chunks appropriate to the hardware parameters.

In this paper, we present the language design for the sequence operations, as well as the compilation and runtgime support, and demonstrate with a set of benchmarks the feasibility of this approach.

12:30 - 14:00 — Lunch

14:00 - 14:50 — Tutorial

Alexander Slesarenko and Alexey Romanov (Huawei Technologies).

Scalan: A Framework for Domain-Specific Hotspot Optimization

While high-level abstractions greatly simplify program development, they ultimately need to be eliminated to produce high- performance code. This can be done using generative programming; one particularly usable approach is Lightweight Modular Staging.

We present Scalan, a framework which enables compilation of high-level object-oriented-functional code into high-performance low-level code. It extends the basic LMS approach by making rewrite rules and compilation stages first-class and extending the graph IR with object-oriented features.

14:50 - 15:10 — Break

15:10 - 16:00 — Skeletons and Chairs' Report

David Duke and Fouzhan Hosseini.

Skeletons for Distributed Topological Computation

Parallel implementation of topological algorithms is highly desirable, but the challenges, from reconstructing algorithms around independent threads through to runtime load balancing, have proven to be formidable. This problem, made all the more acute by the diversity of hardware platforms, has led to new kinds of implementation platform for computational science, with sophisticated runtime systems managing and coordinating large threadcounts to keep processing elements heavily utilized. While simpler and more portable than direct management of threads, these approaches still entangle program logic with resource management. Similar kinds of highly parallel runtime system have also been developed for functional languages. Here, however, language support for higher-order functions allows a cleaner separation between the algorithm and `skeletons' that express generic patterns of parallel computation. We report results on using this technique to develop a distributed version of the Joint Contour Net, a generalization of the Contour Tree to multifields. We present performance comparisons against a recent Haskell implementation using shared-memory parallelism, and initial work on a skeleton for distributed memory implementation that utilizes an innovative strategy to reduce inter-process communication overheads.

Tiark Rompf and Geoffrey Mainland.

Chair's Report

16:00 - 16:30 — Tea Break

16:30 - 17:30 — Concluding Discussion