RAMP Gold comprises about 36,000 lines of SystemVerilog with minimal third-party IP blocks. Our first production system targets the Xilinx Virtex-5 and is deployed on a low-cost XUP board.

RAMP Gold employs many advanced FPGA optimizations and is designed from the ground up with reliability in mind. The following figure shows the structure of RAMP Gold. The timing and functional models are both host-multithreaded. The functional model maintains architected state and correctly executes the ISA. The timing model determines how much time an instruction takes to execute in the target machine and schedules threads to execute on the functional model accordingly. The interface between the functional and timing models is designed to be simple and extensible to facilitate rapid evaluation of alternative target memory hierarchies and microarchitectures.

Functional Model

The functional model is a 64-thread feed-through pipeline with each thread simulating an independent target core. The functional model supports the full SPARC V8 ISA in hardware, including floating point and precise exceptions. It also has sufficient hardware to run an operating system, including MMUs, timers, and interrupt controllers. We have validated the functional model using the SPARC V8 certification suite from SPARC International, and we can boot the Linux 2.6.21 kernel as well as ROS, a prototype manycore research operating system.

Timing Model

The timing model tracks the performance of the 64 target cores. Our initial target processor is an in-order single issue core that sustains one instruction per cycle, except for instruction and data cache misses. Each target core has private L1 instruction and data caches. The cores share a lockup-free L2 cache via a nonblocking crossbar interconnect. Each L2 bank connects to a DRAM controller, which models delay through a first-come-first-serve queue with a fixed service rate. A detailed model of cache coherence timing on realistic interconnects is among our future work; we expect it to fit within the current design framework. Only the timing model has to change to model a wide variety of systems, amortizing the considerable design effort of the functional model. For example, we can model performance of a system with large caches by keeping only the cache metadata in the timing model. In our current timing model, we store all L1 and L2 cache tags in a large number of BRAMs on FPGAs and leverage the parallelism in the circuit to perform multiple highly associative lookups in one host cycle. On the Virtex-5 LX110T FPGA, the design supports up to 12MB of total target cache.

Most of the timing model parameters can be configured at runtime by writing control registers that reside on the I/O bus. Among these are the size, block size, and associativity of L1 and L2 caches, the number of L2 cache banks and their latencies, and DRAM bandwidth and latency.

To measure target performance, we implement 657 64-bit hardware performance counters. Among the events we count are target L1 and L2 cache hits, misses, and writebacks; instructions retired by type; and target clock cycles. We also have a number of host performance counters to measure the simulator’s performance itself. The counters are accessed via a ring interconnect to ease routing pressure.

The timing model largely consists of behavioral SystemVerilog and relies on the memory compiler for BRAM allocation. This coding style enables the rapid prototyping of different microarchitectural ideas. Leveraging many high-level language constructs in SystemVerilog, our manycore timing model comprises only 1,000 lines of code.

Implementation Highlights

  • Works on a single $750 low cost FPGA board.
  • No special CAD tools required (see Document).
  • Two orders of magnitude faster than Simics+GEMS (see RAMP Gold papers)
  • Runtime configurable timing model parameters without resynthesis
  • Full RTL verification environment and software infrastructure (compilers, OS, applications and etc.)