8:20 Introduction by the Chairs
8:30 Invited Talk (Chair: Vikram Adve):
9:15 Session 1 - Analysis and Formalization (Chair: Hans Boehm)
Cilk Runtime Support for Deterministic Parallel Random-Number Generation
Charles E. Leiserson
MIT Computer Science and Artificial Intelligence Laboratory
Existing concurrency platforms for dynamic multithreading (dthreading), such as Cilk, encapsulate the nondeterminism of scheduling. Cilk programs containing no determinacy races execute deterministically, even though Cilk's randomized scheduler may schedule the computation differently from run to run. Unfortunately, none of today's dthreading concurrency platforms provides libraries for repeatable parallel random-number generation. We contend that essential support is missing from the runtime systems of Cilk and other dthreading concurrency platforms.
We propose a new mechanism, called "pedigrees", which can be built into runtime systems to enable efficient deterministic parallel random-number generation. The idea of a pedigree is that each function instantiation can be uniquely identified by a path through the "invocation tree" -- the tree of function activation frames generated by spawns and calls -- which is deterministic in a dthreaded program containing no determinacy races. A random number can then be generated by hashing the pedigree. Experiments with the MIT Cilk runtime system show that the overhead for maintaining a pedigree mechanism is minimal.
We have built library implementations of several deterministic parallel random-number generators that use pedigrees, including a generalization of linear congruential generators, a tabular method, and variants of SHA-1 hashing. Although these deterministic parallel random-number generators tend to be slower per function call than a nondeterministic parallel version of the popular Mersenne twister, on practical benchmark algorithms, such as quicksort and a Monte Carlo simulation, the impact on overall running time is relatively small, especially if additional support for a "frame data cache" is provided by the runtime system.
This research was done jointly with Tao Benjamin Schardl and Jim Sukha of MIT CSAIL.
Prof. Leiserson's research centers on the theory of parallel computing, especially as it relates to engineering reality. He wrote the first paper on systolic architectures with H.T. Kung and invented the retiming method for digital circuit optimization. He designed and led the implementation of the network architecture for the Connection Machine Model CM-5 Supercomputer produced by Thinking Machines Corporation, which incorporated the fat-tree interconnection network he developed at MIT. Fat-trees are now the preferred interconnect strategy for Infiniband technology. His textbook, Introduction to Algorithms, coauthored with Ronald L. Rivest and Thomas H. Cormen was named "Best 1990 Professional and Scholarly Book in Computer Science and Data Processing" by the Association of American Publishers. Currently in its third edition with an additional coauthor Clifford Stein, it is the leading textbook on computer algorithms and, according to Citeseerx, the second most cited publication in computer science. As Director of System Architecture at Akamai Technologies, he led the engineering team that developed a world-wide content-distribution network numbering over 20,000 servers. He introduced the notion of cache-oblivious algorithms, which exploit the memory hierarchy near optimally while containing no tuning parameters for cache size or cache-line length. He developed the Cilk multithreaded programming technology, which featured the first provably efficient work-stealing scheduler. He led the development of several Cilk-based parallel chess-playing programs which have won numerous prizes in international competition. With Matteo Frigo, he founded Cilk Arts, Inc., which developed the Cilk++ multicore concurrency platform and was acquired by Intel Corporation in 2009. He was for many years the head of the computer-science program for the Singapore-MIT Alliance distance-education collaboration. His annual workshop on Leadership Skills for Engineering and Science Faculty, cotaught with Chuck McVinney, has educated hundreds of faculty at MIT and around the world in the nontechnical issues involved in leading technical teams in academia. He is a Margaret MacVicar Faculty Fellow at MIT, the highest recognition at MIT for undergraduate teaching. He is an ACM Fellow and a senior member of IEEE and SIAM.
Deterministic Execution with Domain Specific Languages
Exploiting the power of heterogeneous parallel hardware is complicated because it requires application code development with multiple programming models. I will present a much simpler single programming model approach to heterogeneous parallelism that uses domain specific languages (DSLs). DSLs raise the level of abstraction and present a familiar sequential programming model from which parallelism is extracted implicitly. It is the responsibility of the compiler and runtime system to ensure that parallel execution does not result in non-deterministic behavior. Nondeterministic execution behavior is only possible at the explicit request of the programmer. I will describe two example DSLs that are implemented using these principles: OptiML, a DSL for machine learning and Liszt a DSL for mesh-based PDEs. Both of these DSLs are embedded in the Scala programming language using modular staging and a DSL development framework called Delite.
Kunle Olukotun is a Professor of Electrical Engineering and Computer Science at Stanford University and he has been on the faculty since 1992. Olukotun is a pioneer in multicore processor design and is well known for leading the Stanford Hydra research project which developed one of the first chip multiprocessors (CMP) with support for thread-level speculation (TLS). Olukotun founded Afara Websystems to develop high-throughput, low power server systems with chip multiprocessor technology. The Afara microprocessor technology, called Niagara, was acquired by Sun Microsystems in 2002. Niagara derived processors now power all Sun SPARC-based servers. Olukotun is actively involved in research in computer architecture, parallel programming environments and scalable parallel systems. Olukotun co-lead the Transactional Coherence and Consistency (TCC) project whose goal was to simplify parallel programming for average programmers. Olukotun currently directs the Stanford Pervasive Parallelism Lab (PPL) which seeks to proliferate the use of parallelism in all application areas using Domain Specific Languages (DSLs). Olukotun is an ACM Fellow (2006) and IEEE fellow (2007) for contributions to multiprocessors on a chip and multi threaded processor design. He has authored many papers on CMP design and parallel software and recently completed a book on CMP architecture. Olukotun received his Ph.D. in Computer Engineering from The University of Michigan.
Automatic Verification of Determinism for Structured Parallelism
A major productivity hurdle for parallel programming is the lack of determinism. A parallel program may yield different results on different executions for the same input, depending on the order in which operations are interleaved. In this talk, I will present a static analysis for proving a parallel program deterministic. The main idea is to leverage the structure of the program to reduce determinism verification to an independence property that can be proved using only sequential analysis. The sequential static analysis uses powerful numerical abstractions to establish the independence property: spatial and temporal disjointness. The analysis has been used to successfully validate the determinism of a suite of parallel benchmarks derived from those used in the high performance computing community.
Martin Vechev is a Research Staff Member at the IBM T.J. Watson Center in New York. His interests are in software analysis, concurrency and verification. He obtained his PhD in 2007 from Cambridge University.