Accepted papers

David Duke and Fouzhan Hosseini.

Skeletons for Distributed Topological Computation

Parallel implementation of topological algorithms is highly desirable, but the challenges, from reconstructing algorithms around independent threads through to runtime load balancing, have proven to be formidable.This problem, made all the more acute by the diversity of hardware platforms, has led to new kinds of implementation platform for computational science, with sophisticated runtime systems managing and coordinating large threadcounts to keep processing elements heavily utilized.While simpler and more portable than direct management of threads, these approaches still entangle program logic with resource management. Similar kinds of highly parallel runtime system have also been developed for functional languages.Here, however, language support for higher-order functions allows a cleaner separation between the algorithm and `skeletons' that express generic patterns of parallel computation.We report results on using this technique to develop a distributed version of the Joint Contour Net, a generalization of the Contour Tree to multifields.We present performance comparisons against a recent Haskell implementation using shared-memory parallelism, and initial work on a skeleton for distributed memory implementation that utilizes an innovative strategy to reduce inter-process communication overheads.

Naoki Takashima, Hiroki Sakamoto andYukiyoshi Kameyama.

Generate and Offshore: Type-safe and Modular Code Generation for Low-Level Optimization

We have designed the Asuna system which supports implicitly heterogeneous multi-stage programming in MetaOCaml. Using this system, programmers can write code generators in a high-level language, and generated code can be translated to a program in low-level languages such as C and LLVM. The high-level code generators can make use of all the features of MetaOCaml such as algebraic data types and higher-order functions while the generated code may include low-level CPU instructions such as vector arithmetic operations.

We can write program in the modular and type-safe programming style and can directly represent low-level optimizations at the same time. Asuna is a multi-target system, that means a single code generator can generate code in C or LLVM, without changing the generator. The translation by Asuna preserves typing and every generated code is guaranteed to be well typed and well scoped.

In this paper, we explain the practical side of \asuna, by showing that how one can implement low-level optimizations using the examples of matrix multiplication. We also show that this techniques can be applied to other examples such as Strassen's algorithm and Fast Fourier transformation.

Michael Vollmer, Bo Joel Svensson, Eric Holk and Ryan Newton.

Meta-Programming and Auto-Tuning in the Search for High Performance GPU Code

Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large.

Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell).

Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces.

We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code.

Bo Joel Svensson, Michael Vollmer, Eric Holk, Trevor McDonell and Ryan Newton.

Converting Data-Parallelism to Task-Parallelism by Rewrites: Purely Functional Programs Across Multiple GPUs

High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel.

Frederik Meisner Madsen, Robert Clifton-Everest, Manuel Chakravarty and Gabriele Keller.

Functional Array Streams

Regular array languages for high performance computing based on aggregate operations provide a convenient parallel programming model, which enables the generation of efficient code for SIMD architectures, such as GPUs. However, the data sets that can be processed with current implementations are severely constrained by the limited amount of main memory available in these architectures.

In this paper, we propose an extension of the embedded array language Accelerate with a notion of sequences, resulting in a two level hierarchy which allows the programmer to specify a partitioning strategy which facilitates automatic resource allocation. Depending on the available memory, the runtime system processes the overall data set in streams of chunks appropriate to the hardware parameters.

In this paper, we present the language design for the sequence operations, as well as the compilation and runtgime support, and demonstrate with a set of benchmarks the feasibility of this approach.