SC'09: User Experience and Advances in Bridging Multicore's Programmability Gap

Multicore's "programmability gap" refers to the mismatch between traditionally sequential software development and today's multicore and accelerated computing environments. New parallel languages break with the conventional HPC programming paradigm by offering high-level abstractions for control and data distribution, thus providing direct support for the specification of efficient parallel algorithms for multi-level system hierarchies based on multicore architectures. Together with the emergence of architecture-aware compilation technology, these developments signify important contributions to bridging this programmability gap.

Last year's "Bridging Multicore's Programmability Gap" workshop examined up-and-coming languages such as Chapel, X-10 and Haskell and their approaches to bridging the gap. The foci of this year's workshop are on user experience in using new languages for challenging applications in a multicore environment, the progress made in the area of the accelerated computing software development lifecycle, as well as new compilation and programming environment technology supporting emerging languages.


B. Scott Michel (The Aerospace Corporation)
Hans Zima (NASA Jet Propulsion Laboratory)
Nehal Desai (The Aerospace Corporation)

Workshop Schedule

Download the presentations here.

     Speaker and Topic
8:45 AM
 9:00 AM
Introductory remarks
9:00 AM
 10:00 AM
Kathy Yelick, NERSC
 10:00 AM
 10:30 AM
 Morning Break
 10:30 AM
 11:15 AM
Brad Chamberlain, Cray
An Example-based Introduction to Global-view Programming in Chapel
 11:15 AM
 12:00 PM
 Bob Numrich, MSI
Co-Arrays and Multiple Cores

 12:00 PM
 1:30 PM
 1:30 PM
 2:15 PM
Piyush Mehrotra, NASA Ames
Impact of Resource Contention on Application Performance in Multicore Systems
 2:15 PM
 3:00 PM
Eric Stahlberg, Wittenberg University and OSC and OpenFPGA
Teaching Multicore Programming: Challenges and Lessons
 3:00 PM
 3:30 PM
 Afternoon Break
 3:30 PM
 4:30 PM
Thomas Sterling, LSU
Scaling to Mega-multicore through Advanced Execution Models

Talk Abstracts And Speaker Bios

Katherine Yelick (NERSC)

Biography: Katherine Yelick is the Director of the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory and a Professor of Electrical Engineering and Computer Sciences at the University of California at Berkeley. She is the co-author of two books and more than 100 refereed technical papers on parallel languages, compilers, algorithms, libraries, architecture, and storage. She co-invented the UPC and Titanium languages and demonstrated their applicability across architectures through the use of novel runtime and compilation methods. She also co-developed techniques for self-tuning numerical libraries, including the first self-tuned library for sparse matrix kernels which automatically adapt the code to properties of the matrix structure and machine. Her work includes performance analysis and modeling as well as optimization techniques for memory hierarchies, multicore processors, communication libraries, and processor accelerators. She has worked with interdisciplinary teams on application scaling, and her own applications work includes parallelization of a model for blood flow in the heart. She earned her Ph.D. in Electrical Engineering and Computer Science from MIT and has been a professor of Electrical Engineering and Computer Sciences at UC Berkeley since 1991 with a joint research appointment at Berkeley Lab since 1996. She has received multiple research and teaching awards and is a member of the California Council on Science and Technology and a member of the National Academies committee on Sustaining Growth in Computing Performance.

Brad Chamberlain (Cray): An Example-based Introduction to Global-view Programming in Chapel

Biography: Bradford Chamberlain is a Principal Engineer at Cray Inc., where he works on parallel programming models, focusing primarily on the design and implementation of the Chapel language in his role as technical lead for that project.  Before starting at Cray in 2002, he spent a year at a start-up working at the opposite end of the hardware spectrum to design a parallel language (SilverC) for reconfigurable embedded hardware.  Brad received his Ph.D. in Computer Science & Engineering from the University of Washington in 2001 where his work focused on the design and implementation of the ZPL parallel array language, particularly on its concept of the region--a first-class index set supporting global-view distributed array programming. While at UW, he also dabbled in algorithms for accelerating the rendering of complex 3D scenes.  Brad remains associated with the University of Washington as an affiliate faculty member and recently co-led a seminar there that focused on the design of Chapel.  He received his Bachelor's degree in Computer Science from Stanford University with honors in 1992.

Bob Numrich (MSI): Co-Arrays and Multiple Cores

Abstract:The co-array parallel programming model will be a standard feature of Fortran 2008.  The co-array model is an SPMD model with a single program replicated a fixed number of times with local memory associated with each replication called an "image". Physical processors (cores) are assigned to images to perform work on data in the associated memory, and co-dimensions provide an explicit syntax for referencing data that is associated with other images. Multiple co-dimensions can be matched to hierarchies within a multi-core system.  The first co-dimension might, for example, match the number of cores sharing memory on a particular chip.  The second co-dimension might match the number of chips in a node, and the third co-dimension might match the number of nodes in the system.  One core, might be assigned to each image with shared memory partitioned among cores.  Alternatively, one core might be assigned to an image associated with all the shared memory with the other cores sharing the work through OpenMP threads spawned by the core owning the image.  The effectiveness of such a programming model will depend, to a large extent, on run-time issues outside the co-array model beyond the control of the programmer.  Cache coherency protocols, memory partitioning algorithms, overheads for spawning threads, and bandwidth to local memory will all affect performance in unknown ways.

Eric Stahlberg (Wittenberg University and OSC): Teaching Multicore Programming: Challenges and Lessons

Abstract: The presentation will focus on the challenge of teaching multicore programming to students newly acquainted with parallel computing. Effective techniques and approaches that apply across languages will be emphasized. A new NSF funded effort to include parallel and accelerated concepts across all multiple computer science classes at the undergraduate level will provide a forward looking perspective into approaches to introduce students to the emerging new fundamentals for software application development for parallel computing environments.

Biography: Dr. Stahlberg has been actively working with challenges of parallel programming since 1992, where he worked at Argonne National Lab as a post-doctoral researcher exploring effective programming techniques for large MPP systems and distributed parallel systems. While at Cray, Dr. Stahlberg developed effective programming abstractions supporting both MPP and SMP programming in a single code base. In 1997, while employed by Oxford Molecular, Dr. Stahlberg worked with engineers at Intel to incorporate the new OpenMP standard into commercial chemistry applications. Most recently, Dr. Stahlberg has been focusing on efforts to integrate heterogenous acceleration technologies into mainstream software development efforts in a robust and reliable manner. As an adjunct instructor, he has been teaching computer science, computational science and software engineering  for over 15 years. Dr. Stahlberg is currently employed at Wittenberg University as a visiting computational scientist directing the school's computational science program.

Thomas Sterling (Louisiana State University): Scaling to Mega-multicore through Advanced Execution Models

Abstract: The effective exploitation of multicore architectures is proving challenging for single nodes comprising between one and eight sockets. Contention for socket pins, memory access, cache behavior and TLB global address translation are motivating new methods of user thread based computing, programming, and runtime. TBB, Cilk, Concert, Qthreads and other user thread packages have been developed to provide dynamic use of multicore components for applications. But what of large scale systems comprising not just a few sockets but tens of thousands or more for the highest capability systems of the end of the next decade. Studies suggest that Exascale systems before 2020 may incorporate more than a hundred thousand 3-D devices (multi-chip stacks) of a thousand cores each. How are Exascale systems to support computations with sufficient multi-core devices to yield billion-way parallelism (conservative estimate).  Further, how is this to be achieved within the constraints of power consumption where as much as 50 Gigaflops per watt performance/power efficiency may be essential?  It is proposed that one important stratagem will be the derivation and adoption of a new model of computation to mark this HPC phase change and to enable new system architecture designs and their operation. This presentation will describe an approach to multi-core programming that extends the domain of the scale of such systems beyond conventional practices to hundreds of racks and hundreds of millions of cores (by the end of the next decade) while retaining semantic consistency system-wide.

Biography: Dr. Thomas Sterling is a Professor of Computer Science at Louisiana State University, a Faculty Associate at California Institute of Technology, a Distinguished Visiting Scientist at Oak Ridge National Laboratory, and  a Fellow of Computer Science Research Institute at Sandia National Laboratory. He received his PhD as a Hertz Fellow from MIT in 1984. Dr. Sterling is probably best known as the “father” of Beowulf clusters and for his research on Petaflops computing architecture. He was one of several researchers to receive the Gordon Bell Prize for this work on Beowulf 1997. In 1996, he started the inter-disciplinary HTMT project to conduct a detailed point design study of an innovative Petaflops architecture. He currently leads the MIND memory accelerator architecture project for scalable data-intensive computing and is an investigator on the DOE sponsored Fast-OS Project to develop a new generation of configurable light-weight parallel runtime software system. Thomas is co-author of five books and holds six patents.

Piyush Mehrotra (NASA Ames)

Abstract: Contention for shared resources in the memory hierarchy can have a profound effect on the performance of applications running on high-end computers based on commodity multicore microprocessors.  In this talk, we describe our methodology of differential performance analysis to quantify this contention effect for a collection of parallel benchmarks and applications.  In particular, by comparing runs that use different patterns of assigning processes to cores, we can characterize the contention for a specific shared resource.  In the study we used a subset of the HPCC benchmarks, the NAS Parallel Benchmarks, and several applications of interest to NASA. We ran them on high-end computing platforms that use four different quad-core  microprocessors---Intel Clovertown, Intel Harpertown, AMD Barcelona, and Intel Nehalem-EP.  The results help further our understanding of the requirements these codes place on their execution environments and also of each computer's ability to deliver performance.