The GPU Mekong Project - Simplified Multi-GPU Programming


The main objective of (GPU) Mekong is to provide a simplified path to scale out the execution of GPU programs from one GPU to almost any number, independent of whether the GPUs are located within one host or distributed at the cloud or cluster level. Unlike existing solutions, this work proposes to maintain the GPU’s native programming model, which relies on a bulk-synchronous, thread-collective execution; that is, no hybrid solutions like OpenCL/CUDA programs combined with message passing are required. As a result, we can maintain the simplicity and efficiency of GPU computing in the scale-out case, together with a high productivity and performance.

In essence, Mekong allows for resource aggregation of compute and memory without exposing the typical programming complexities that are associated with such aggregations. Instead of having multiple GPU devices with a complex, partitioned Bulk Synchronous Parallel (BSP) domain and multiple memory resources within a partitioned Global Address Space (GAS) domain, Mekong aggregates these resources in a way such that the user only sees flat domains, while automated techniques ensure that partitioning is leveraged for improved locality.

Leveraging the beauty of data-parallel programming styles for simplified BSP and GAS aggregations

We observe that data-parallel languages like OpenCL or CUDA can greatly simplify parallel programming, so that hybrid solutions like sequential code enriched with vector instructions are not required. The inherent domain decomposition principle for these languages ensures a fine granularity when partitioning the code, typically resulting in a mapping of one single output element to one thread and reducing the need for work aglommeration. The BSP programming paradigm and its associated slackness regarding the ratio of virtual to physical processors allows effective latency hiding techniques that make large caching structures obsolete. At the same time, a typical BSP code exhibits substantial amounts of locality, as the rather flat memory hierarchy of thread-parallel processors has to rely on large amounts of data reuse to keep their vast amount of processing units busy.

In the GPU Mekong project, we leverage these observations to design a compile- and run-time system that allows for programming an arbitrary number of thread-parallel processors like GPUs with a single OpenCL (future: CUDA) program. As opposed to other state-of-the-art research, the actual number of GPUs is hidden from the user at design time and during the execution, allowing an easy migration from single-device execution to multi-device.

We base our approach on compilation techniques including static code analysis and code transformations regarding host and device code. We initially focus on multiple GPU devices within one machine boundary (a single computer), allowing us to hide the complications of multi-device programming from the user (cudaSetDevice, streams, events, and similar). Our initial tool stack is based on OpenCL programs as input, LLVM as the compilation infrastructure and CUDA backends to orchestrate data movement and kernel launches on any number of GPUs.

Future efforts will include support for multiple GPUs at cluster/system level, so one can leverage the availability of a large number of GPUs within a cluster, cloud or similar by programming them with a single data-parallel program.

About the name

With Mekong we are actually referring to the Mekong Delta, a huge river delta in southwestern Vietnam that transforms from one of the longest rivers of the world into an abundant number of distributaries, before this huge water stream is finally emptied in the South China Sea. It forms a large triangle that embraces a variety of physical landscapes, and is famous among backpackers and tourists as travel destination.

What actually motivated us to choose Mekong as a name, is the fact that a single huge stream is transformed into a large number of distributaries; an effect that we are also seeing in our GPU project: Mekong as a project gears to transform a single data stream into a large number of smaller streams that embrace smaller islands (computational units, memory) that mostly operate independently except for interactions like data distribution, communication, and synchronization. 

The Mekong project was previously called GCUDA, and you might find a few reference to this old name. As we moved from CUDA to OpenCL as the primary input language, the previous name the project name was changed to reflect its new focus. 

About the researchers

The Mekong project was initiated by the Computer Engineering Group (, Institute of Computer Engineering ( at Ruprecht-Karls University of Heidelberg, Germany ( It initially received funding in form a Google Faculty Research Award, and meantime is funded by the German Ministry for Education and Research (BMBF). 

Current team:
  • Holger Fröning, PI (holger.froening (at)
  • Vincent Heuveline, co-PI (vincent.heuveline (at)
  • Simon Gawlok, PhD student (simon.gawlok (at)
  • Alexander Matz, PhD student (alexander.mat (at)
  • Lorenz Braun, PhD student (lorenz.braun (at)
Associated partners
  • Tobias Grosser (ETHZ)
  • Axel Köhler, Stefan Kramer (NVIDIA Germany)
For additional questions or comments, please contact the PI: Holger Fröning, holger.froening (at)


As we are still in the "hot" development phase, the code is not yet publicly available. Two comments:
  • You can expect it to be available soon on a site like GitHub or similar. Stay tuned for updates!
  • If you can't wait, contact us so we can discuss possible opportunities!


We are currently heavily working on the tool stack for Mekong. We will host here all our dissemination activities in terms of documentation, research papers, code, examples, tutorials and more.

Publications, reports, etc.


We gratefully acknowledge the sponsoring we have received from Google (Google Research Award, 2014) and the German Excellence Initiative, with substantial equipment grants from Nvidia.