Research

Bottleneck Analysis - Development of a detailed performance model of a processor to find the performance bottlenecks.

• Criticality-driven Design Space Exploration - Development of techniques to perform fast design space exploration of a processor. [pdf][presentation]

It has become increasingly difficult to perform design space exploration (DSE) of computer systems with a short turnaround time because of exploding design spaces, increasing design complexity and long-running workloads. Researchers have used classical search/optimization techniques like simulated annealing, genetic algorithms, etc., to accelerate the DSE. While these techniques are better than an exhaustive search, a substantial amount of time must still be dedicated to DSE. This is a serious bottleneck in reducing research/development time. These techniques do not perform the DSE quickly enough, primarily because they do not leverage any insight as to how the different design parameters of a computer system interact to increase or degrade performance at a design point and treat the computer system as a “black-box”.

We propose using criticality analysis to guide the classical search/optimization techniques. We perform criticality analysis to find the design parameter which is most detrimental to the performance at a given design point. Criticality analysis at a given design point provides a localized view of the region around the design point without performing simulations at the neighboring points. On the other hand, a classical search/optimization technique has a global view of the design space and avoids getting stuck at a local maximum. We use this synergistic behavior between the criticality analysis (good locally) and the classical search/optimization techniques (good globally) to accelerate the DSE.

FabScalar - Development of a detailed synthesizable RTL model of an out-of-order superscalar processor, parameterized by pipeline width, depth and structure sizes. [pdf]

A growing body of work has compiled a strong case for the single-ISA heterogeneous multi-core paradigm. A single-ISA heterogeneous multi-core provides multiple, differently-designed superscalar core types that can streamline the execution of diverse programs and program phases. No prior research has addressed the “Achilles’ heel” of this paradigm: design and verification effort is multiplied by the number of different core types.

This work frames superscalar processors in a canonical form, so that it becomes feasible to quickly design many cores that differ in the three major superscalar dimensions: superscalar width, pipeline depth, and sizes of structures for extracting instruction-level parallelism (ILP). From this idea, we develop a toolset, called FabScalar, for automatically composing the synthesizable register-transfer-level (RTL) designs of arbitrary cores within a canonical superscalar template. The template defines canonical pipeline stages and interfaces among them. A Canonical Pipeline Stage Library (CPSL) provides many implementations of each canonical pipeline stage, that differ in their superscalar width and depth of sub-pipelining. An RTL generation tool uses the template and CPSL to automatically generate an overall core of desired configuration. Validation experiments are performed along three fronts to evaluate the quality of RTL designs generated by FabScalar: functional and performance (instructions-per-cycle (IPC)) validation, timing validation (cycle time), and confirmation of suitability for standard ASIC flows. With FabScalar, a chip with many different superscalar core types is conceivable.

Retention-aware Placement in DRAM (RAPID)Integration of RAPID, a novel software approach to reduce the refresh power of the DRAM to vanishingly small levels, into a wireless sensor node. [poster]

SlipstreamOptimization of Instruction Removal (IR) Predictor (a key component inside Slipstream Processors) to make it small, so that implementation is possible inside a real processor.

Loop Transformation using SUIF – Automation of Loop Peeling, Loop Unrolling and Loop Splitting of input C code using a compiler infrastructure called SUIF.