1980days since
SC'08 Starts!

SC'08 Workshop: Bridging Multicore's Programmability Gap

Processor designers are aggressively reclaiming "Moore's Law" in performance by packaging multiple execution units on a single die a.k.a "multicore". This approach is similar to the massively parallel processor (MPP) systems from the 80's and early 90's and brings with it the same problem that hampered MPP adoption: How to develop software that takes advantage of this computing power.

What has emerged today is a "Programmability Gap", the gap between multicore-based systems and current languages, compilers and software development techniques. This workshop features a series of academic and industry speakers whose work presents a broad cross section of the current landscape of solutions addressing this programmability gap.

Workshop organizers:

B. Scott Michel (The Aerospace Corporation)
Hans Zima (NASA Jet Propulsion Laboratory)

Workshop Schedule

(Note: Follow this link to the slides for each presentation)

     Speaker and Topic
8:45 AM
 9:00 AM
 Introductory remarks
9:00 AM
 10:00 AM
 Brad Chamberlain, Cray
 "Chapel: an HPC language in a multicore world"
 10:00 AM
 10:30 AM
 Morning Break
 10:30 AM
 11:15 AM
 Vivek Sarkar, Rice University
 "Multicore Programming Models and their Implementation Challenges"
 11:15 AM
 12:00 PM
 Don Stewart,
 "Beautiful Parallelism: Harnessing Multicores with Haskell"
 12:00 PM
 1:30 PM
 1:30 PM
 2:30 PM
 David Bader, Georgia Tech
 "Accelerating Applications with Cell Broadband Engine, Graphics, and Multithreaded, Processors"
 2:30 PM
 3:00 PM
 Richard Schooler, VP Software Engineering, Tilera Corporation
 3:00 PM
 3:30 PM
 Afternoon Break
 3:30 PM
 4:15 PM
 Franz Franchetti, Carnegie Mellon University
 "Spiral: Generating Parallel Software for Linear Transforms (And Beyond)"
 4:15 PM
 5:00 PM
 Mary Hall, University of Utah

Talk Abstracts

Brad Chamberlain (Cray): Chapel: an HPC language in a multicore world

Chapel is a new programming language being developed by Cray Inc. as part of the DARPA-led High Productivity Computing Systems program (HPCS). Chapel strives to increase productivity for supercomputer users by supporting higher levels of abstraction compared to current parallel programming models while also supporting the ability to optimize to performance that meets or surpasses current technologies.  While not specifically targeted at the emerging world of mainstream multicore programming, Chapel has been designed to support general parallel programming and its implementation has been designed for portability in order to develop codes on multicore desktop workstations.  In this talk, I will give an overview of Chapel and strive to enumerate its potential benefits and challenges in helping to bridge the multicore programmability gap.

Bio: Bradford Chamberlain is a Principal Engineer at Cray Inc., where he works on parallel programming models, focusing primarily on the design and implementation of the Chapel language in his role as technical lead for that project.  Before starting at Cray in 2002, he spent a year at a start-up working at the opposite end of the hardware spectrum to design a parallel language (SilverC) for reconfigurable embedded hardware.  Brad received his Ph.D. in Computer Science & Engineering from the University of Washington in 2001 where his work focused on the design and implementation of the ZPL parallel array language, particularly on its concept of the region--a first-class index set supporting global-view distributed array programming. While at UW, he also dabbled in algorithms for accelerating the rendering of complex 3D scenes.  Brad remains associated with the University of Washington as an affiliate faculty member and recently co-led a seminar there that focused on the design of Chapel.  He received his Bachelor's degree in Computer Science from Stanford University with honors in 1992.

Vivek Sarkar (Rice University): Multicore Programming Models and their Implementation Challenges

The computer industry is at a major inflection point in its hardware roadmap due to the end of a decades-long trend of exponentially increasing clock frequencies.  It is widely agreed that spatial parallelism in the form of multiple power-efficient cores must be exploited to compensate for this lack of frequency scaling, and that this trend will lead to manycore chips containing hundreds of general-purpose and special-purpose cores.  Unlike previous generations of hardware evolution, this shift towards multicore and manycore computing will have a profound impact on software by creating new challenges in the management of parallelism, locality, energy, and fault-tolerance.  These software challenges are further compounded by the need to enable parallelism in workloads and application domains that have traditionally not had to worry about multiprocessor parallelism in the past.

In this talk, we will focus on the programming problem for tightly coupled homogeneous and heterogeneous multicore processors.  We present early experiences with the new Habanero Multicore Software Research project at Rice University that encompasses work on programming models, compilers, runtimes, and concurrency libraries so as to enable portable software that can run unchanged on a range of homogeneous and heterogeneous multicore systems.  The Habanero project takes a two-level approach to programming models, with a high-level model based on Intel Concurrent Collections (formerly known as TStreams) for parallelism-oblivious domain experts , and a lower-level model based on the high productivity X10 language for parallelism-aware developers.  We show how the Habanero execution model can be used as a foundation to understand a variety of multicore programming models including Cilk, CUDA, Intel Threading Building Blocks, Java Concurrency, .Net Parallel Extensions, and OpenMP, Java Concurrency Utilities, Intel Thread Building Blocks, .Net Task Parallel Library & PLINQ, and we also discuss compiler and runtime implementation challenges that must be overcome to enable mainstream applications to use these models on multicore systems.

Don Stewart ( "Beautiful Parallelism: Harnessing Multicores with Haskell"

Haskell ( is a general purpose, purely functional programming language. If you want to program a parallel machine, a purely functional language such as Haskell is a good choice: purity ensures the language is by-default safe for parallel execution, (whilst traditional imperative languages are by-default unsafe).

This foundation has enabled Haskell to become something of a melting pot for high level approaches to concurrent and parallel programming, all available with an industrial strength compiler and language toolchain, available now for mainstream multicore programming.

In this talk I will introduce the features Haskell provides for writing high level parallel and concurrent programs. In particular we'll focus on lightweight semi-explicit parallelism using annotations to express parallelism opportunities. We'll then describe mechanisms for explicitly parallel programs focusing on software transactional memory (STM) for shared memory
communication. Finally, we'll look at how Haskell's nested data parallelism allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.

Biography: Don is an Australian hacker and engineer at Galois, Inc, in Portland, Oregon. Galois' mission is to create trustworthiness and assurance in critical systems, with an emphasis on language design, compiler construction and formal methods. Don is co-author of an upcoming book from O'Reilly Media, "Real World Haskell (

David Bader (Georgia Tech): "Accelerating Applications with Cell Broadband Engine, Graphics, and Multithreaded, Processors"

While we are still witnessing Moore's Law by the steady production of chips that mass billions of transistors, clearly we have reached plateaus on clock frequency, power, and single stream performance. This new era has caused a rethinking of microprocessor design in search of innovations that will allow the continued performance improvement of scientific applications at an exponential rate. One technology that holds promise combines traditional microprocessors with special-purpose, very high performance, low-power chips such as the IBM Cell Broadband Engine, Graphics processors, and FPGAs, to accelerate the performance of computational science and engineering applications. The use of these chip accelerators will likely be a path forward, yet new challenges await such as system-level design, partitioning applications to accelerators, and tools for designing applications. The Sony-Toshiba-IBM Cell Broadband Engine is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE), with eight SIMD coprocessing units (SPEs) integrated on-chip. Because of the performance capabilities of the Cell BE, it is considered as an application accelerator for next-generation petascale supercomputers. Another promising technology, the Cray XMT - a massive latency-tolerent multithreaded architectures - accelerates performance on applications that use massive-scale data analytics. The XMT employs fine-grained threads to tolerate latency for irregular applications that are often challenging to parallelize on traditional cache-based architectures.  In this talk, we will discuss the programmability gap for these future technologies, and propose a path forward that combines user annotations and advanced compilation systems with novel architectural designs, to better exploit the multicore and manycore systems.

Frank Franchetti (Carnegie Mellon University): "Spiral: Generating Parallel Software for Linear Transforms (And Beyond)"

Spiral ( is a program and hardware design generation system for linear transforms such as the discrete Fourier transform, discrete cosine transforms, filters, and others. For a user-selected transform, Spiral autonomously generates different algorithms, represented in a declarative form as mathematical formulas, and their implementations to find the best match to the given target platform. Besides the search, Spiral performs deterministic optimizations on the formula level, effectively restructuring the code in ways unpractical at the code or design level.

In this talk, we give a short overview on Spiral. We explain then how Spiral generates efficient programs for parallel platforms including vector architectures, shared memory and multicore platforms, distributed memory platforms, the Cell BE processor and GPUs; as well as hardware designs (Verilog) and automatically partitioned software/hardware implementations. As all optimizations in Spiral, parallelization and partitioning are performed on a high abstraction level of algorithm representation, using
rewriting systems. We also discuss how Spiral is currently extended beyond its original problem domain, using coding algorithms (Viterbi decoding and JPEG 2000 encoding) and image formation synthetic aperture radar, SAR) as examples. Lastly, we will discuss how Spiral generates general-size self-adaptive libraries solely from a problem specification.