Programme (with abstracts)
9:00 - 10:00 — Welcome and Invited Talk
Mary Sheeran.
Chairs' Welcome to FHPC 2014
Dimitrios Vytiniotis (Microsoft Research Cambridge).
Ziria: wireless programming for hardware dummies (invited talk)
Abstract: Software-defined radio (SDR) brings the flexibility of software to the domain of wireless protocol design, promising both an ideal platform for research and innovation and the rapid deployment of new protocols on existing hardware. Most existing SDR platforms require careful hand-tuning of low-level code to be useful in the real world. In this talk I will describe Ziria, an SDR platform that is both easily programmable and performant. Ziria introduces a programming model that builds on ideas from functional programming and that is tailored to wireless physical layer tasks. The model captures the inherent and important distinction between data and control paths in this domain. I will describe the programming model, give an overview of the execution model, compiler optimizations, and current work. We have used Ziria to produce an implementation of 802.11a/g and a partial implementation of LTE.
Joint work with Bozidar Radunovic (MSRC), Gordon Stewart (PhD student at Princeton), Mahanth Gowda (PhD student at UIUC), and Geoffrey Mainland (Drexel).
10:30 - 11:20 — Applications
Christian Harrington, Nicolai Dahl, Peter Sestoft and David Raymond Christiansen.
Pension Reserve Computations on GPUs
Abstract: New regulations from the European Union, called Solvency II, require that life insurance and pension providers perform more complicated calculations to demonstrate their solvency. At the same time, exploiting alternative computational paradigms such as GPGPU requires a high degree of expertise about the hardware and ties the computational infrastructure to one particular platform. In an industry where contracts literally last a lifetime, this is far from optimal. We demonstrate the feasibility of an alternative approach in which life insurance and pension products are represented in a high-level, declarative, domain-specific language from which platform-specific high-performance code can be generated. Specifically, we generate CUDA C code after applying both domain- and platform-specific optimizations. This code significantly outperforms equivalent code running on conventional CPUs.
David Duke, Fouzhan Hosseini and Hamish Carr.
Parallel Computation of Multifield Topology: Experience of Haskell in a Computational Science Application
Abstract: Codes for computational science and downstream analysis (visualization and/or statistical modeling) have historically been dominated by imperative thinking, but this situation is evolving, both through adoption of higher-level tools such as Matlab, and through osmosis of functional ideas into the next generation of toolkits driven by the vision of extreme-scale computing. However, this is still a long way from seeing a functional language like Haskell used in a live application. This paper makes three contributions to functional programming in computational science. First, we describe how use of Haskell was interleaved in the development of the first practical approach to multifield topology, and its application to the analysis of data from nuclear simulations that has led to new insight into fission. Second, we report on subsequent development of the functional codebase, and present results that show that Haskell outperforming the imperative codebase in sequential execution, and achieving good parallel scaling. Finally we consider the broader question of how, where, and why functional programming may - or may not - find further use in computational science.
11:40 - 12:30 — Compilation I
Georgios Fourtounis and Nikolaos Papaspyrou.
An efficient representation for lazy constructors using 64-bit pointers
Abstract: Pointers in the AMD64 architecture contain unused space, a feature often exploited by modern programming language implementations. We use this property in a defunctionalizing compiler for a subset of Haskell, generating fast programs having a compact memory representation of their runtime structures. We demonstrate that, in most cases, the compact representation is faster, uses less memory and has better cache characteristics. Our prototype shows competitive performance when compared to GHC with full optimizations on.
Troels Henriksen, Martin Elsman and Cosmin E. Oancea.
Size Slicing - A Hybrid Approach to Size Inference in Futhark
Abstract: We present a shape inference analysis for a purely-functional language, named Futharc, that supports nested parallelism via array combinators such as map, reduce, filter, scan. Our approach is to compute precise shape information as runtime values, which can be efficiently accessed and effectively optimized by the standard compiler optimizations. Instead of restricting the language or sacrificing ease of use, we allow the occasional shape-misbehaved construct. The latter is treated with a fallback technique that preserves the program asymptotic by computing the array's shape after the array's creation point but before its first use. This corresponds to a shape-dependent system with existentially-quantified types.
We optimize the common case to negligible overhead via size slicing: a technique that separates the computation of the array's shape from its values. This corresponds to eliminating the existential quantifier of the shape-dependent type. We report negligible asymptotic overhead on several mini-benchmarks and three real-world applications.
12:30 - 14:00 — Lunch
14:00 - 14:50 — Optimizing Compilation
Joel Svensson and Josef Svenningsson.
Defunctionalizing Push Arrays
Abstract: Recent work on domain specific languages (DSLs) for high performance array programming has given rise to a number of array representations. In Feldspar and Obsidian there are two different kinds of arrays, called Pull and Push arrays and in Repa there is a even higher number of different array types. The reason for having multiple array types is to obtain code that performs better. Pull and Push arrays provide this by guaranteeing that operations fuse automatically. It is also the case that some operations are easily implemented and perform well on Pull arrays, while for some operations, Push arrays provide better implementations. Repa has Pull arrays, called delayed arrays for the same reason and so-called cursored arrays which are important for performance in stencil operations. But do we really need to have more than one array representation? In this paper we derive a new array representation from Push arrays that have all the good qualities of Pull and Push arrays combined. This new array representation is obtained via defunctionalization of a Push array API.
Amos Robinson, Ben Lippmeier and Gabriele Keller.
Fusing Filters with Integer Linear Programming
Abstract: The key to compiling functional, collection oriented array programs into efficient code is to minimise memory trafffc.
Simply fusing subsequent array operations into a single computation is not sufficient; we also need to cluster separate traversals of the same array into a single traversal.
Previous work demonstrated how Integer Linear Programming (ILP) can be used to cluster the operators in a general data-flow graph into subgraphs, which can be individually fused.
However, these approaches can only handle operations which preserve the size of the array, thereby missing out on some optimisation opportunities.
This paper addresses this shortcoming by extending the ILP approach with support for size-changing operations, using an external ILP solver to find good clusterings.
15:10 - 16:00 — Programming Patterns
Prabhat Totoo and Hans-Wolfgang Loidl.
Lazy Data-Oriented Evaluation Strategies
Abstract: This paper presents a number of flexible parallelism control mechanisms in the form of evaluation strategies for tree-like data structures implemented in Glasgow parallel Haskell. We achieve additional flexibility by using laziness and circular programs in the coordination code. Heuristics-based parameter selection is employed to auto-tune these strategies for improved performance on a shared-memory machine without programmer-specified parameters. In particular for unbalanced trees we demonstrate improved performance on a state-of-the-art multi-core server: giving a speedup of up to 37.5 on 48 cores for a constructed test program, and up to 15 for two other non-trivial applications using these strategies, a Barnes-Hut implementation of the n-body problem and a sparse matrix multiplication implementation.
Felix Hargreaves, Daniel Merkle and Peter Schneider-Kamp.
Group Communication Patterns for High Performance Computing in Scala
Abstract: We developed a Functional Object-Oriented PARallel framework (FOOPAR) for high-level high-performance computing in Scala. Central to this framework are Distributed Memory Parallel Data structures (DPDs), i.e., collections of data distributed in a shared nothing system together with parallel operations on these data.
In this paper, we first present FOOPAR’s architecture and the idea of DPDs and group communications. Then, we show how DPDs can be implemented elegantly and efficiently in Scala based on the Traversable/Builder pattern, unifying Functional and Object-Oriented Programming.
We prove the correctness and safety of one communication algorithm and show how specification testing (via ScalaCheck) can be used to bridge the gap between proof and implementation. Furthermore, we show that the group communication operations of FOOPAR outperform those of the MPJ Express open source MPI-bindings for Java, both asymptotically and empirically.
FOOPAR has already been shown to be capable of achieving close-to-optimal performance for dense matrix-matrix multiplication via JNI. In this article, we present results on a parallel implementation of the Floyd-Warshall algorithm in FOOPAR, achieving more than 94% efficiency compared to the serial version on a cluster using 100 cores for matrices of dimension 38000 × 38000.
16:30 - 17:30 — Compilation II (Heterogeneous Systems) / Concluding Discussion
Hai Liu, Laurence Day, Neal Glew, Todd Anderson and Rajkishore Barik.
Native Offload of Haskell Repa Programs to GPGPU
Abstract: In light of recent hardware advances, General Purpose Graphics Processing Units (GPGPUs) are becoming increasingly commonplace, and demand novel programming models to account for their radically different architecture. For the most part, existing approaches to programming GPGPUs within a high-level programming language choose to embed a domain specific language (DSL) within a host metalanguage and implement a compiler mapping programs written within that DSL to code in low-level languages such as OpenCL or CUDA. An alternative, underexplored, approach is to compile a restricted subset of the host language itself directly down to OpenCL/CUDA. We believe more research should be done to compare these two approaches and their relative merits. As a step in this direction, we implemented a quick proof of concept of the alternative approach. Specifically, we extend the Repa library with a computeG function to offload a computation to the GPU. As long as the requested computation meets certain restrictions, we can compile it to OpenCL 2.0 using the shared virtual memory feature recently added, and we can successfully run nine benchmarks on an Intel integrated GPU. We obtain the expected performance from the GPU on six of those benchmarks, and not far off the expected performance on two more. In this paper, we describe an offload primitive for Haskell, how to extend Repa to use it, how to implement that primitive in the Intel Labs Haskell Research Compiler, and evaluate the approach on nine benchmarks, comparing to two different CPUs, and for one benchmark to hand-written OpenCL code.
Thibaut Lutz and Vinod Grover.
LambdaJIT: A Dynamic Compiler for Heterogeneous Optimizations of STL Algorithms
Abstract: C++11 introduced a set of new features to extend the core language and the standard library. Amongst the new features are basic blocks for concurrency management like threads and atomic operation support, and a new syntax to declare single purpose, one off functions, called lambda functions, which integrate nicely to the Standard Template Library (STL). The STL provides a set of high level algorithms operating on data ranges, often applying a user defined function, which can now be expressed as a lambda function. Together, an STL algorithm and a lambda function provides a concise and efficient way to express a data traversal pattern and localized computation.
This paper presents LambdaJIT; a C++11 compiler and a runtime system which enable lambda functions used alongside STL algorithms to be optimized or even re-targeted at runtime. We use compiler integration of the new C++ features to analyze and automatically parallelize the code whenever possible. The compiler also injects part of a program’s internal representation into the compiled program binary, which can be used by the runtime to re-compile and optimize the code. We take advantage of the features of lambda functions to create runtime optimizations exceeding those of traditional offline or online compilers. Finally, the runtime can use the embedded intermediate representation with a different backend target to safely offload computation to an accelerator such as a GPU, matching and even outperforming CUDA by up to 10%.