The goal of the Stanford ESC project is to significantly increase the energy efficiency of many-core supercomputers.  Our primary research focus is on data coherency, movement, and communication in chips and systems.  We target a wide-range of high performance applications, with the goal of reducing energy usage via several different mechanisms.

Computing systems today are power limited and undergoing an unprecedented shift toward parallelism. Currently, commodity processors have up to twelve cores and vendor roadmaps show this number doubling every 18 months.  This shift toward multi-core processors is driven by two factors: instruction-level parallelism (ILP) reaching its practical limit, and the end of voltage scaling.  Without voltage scaling, each technology becomes more power dense than the previous, making energy efficiency (operations per Joule) on a chip a major issue.  Often the energy of a floating point calculation in isolation is an order of magnitude less than the total energy per flop averaged across execution.  This energy overhead is spent primarily by data and instruction supply, a problem exacerbated by complex cache hierarchies and coherence protocols.

The Efficient Supercomputing (ESC) project builds off the knowledge gained from our previous ELM embedded architecture, and focuses on supercomputing chips and systems. In particular, we are focusing on the tradeoff that exists between energy efficiency and ease of programming. Our goal is to reduce energy and performance overhead in the memory subsystem by exposing architectural features to the programmer, without introducing a radically different programming model. ESC's architecture maintains a global, shared address space with hardware coherence but exposes more of the memory hierarchy than is typical. We support block, stride, and gather operations within the memory hierarchy. In addition to mechanisms for data movement, we support active messages which essentially moves instructions to the data.   The programmer will also be exposed to processor locality, enabling threads that share data to be co-located, minimizing communication costs. Through these features, we will significantly reduce the energy per flop, while maintaining programmability.

Last updated: 9/23/2010