The FPU test chip, FPMax, implements 4 efficient FPUs (floating-point multiply-add units) generated by FPU generator. In particular, we pick one configuration from each efficient frontier of {single precision, double precision} x {throughput, latency} application. With the advanced 28nm SOI technology, we are interested to see that our FPUs deliver the top tier energy as well as area efficiency. The chip has been submitted for fabrication in August 2013. More updates will be provided later.
We use the 28nm fully-depleted SOI technology, the flip-well technology in particular, from STMicroelectronics. This technology features the wide body-bias range (from -0.3V to 3.0V) and thus the flexibility of tading-off the performace and the leakage. Check the research paper for more details: Ultra-wide body-bias range LDPC decoder in 28nm UTBB FDSOI technology
The following graph demonstrates the system architecture of the chip. There are a few points:
The four FPUs are on different power domains (PD_0 to PD_3) separated from the main power domain (PD_MAIN).
Only one of four FPU will be selected and enabled by FPControl via DPSelector, while the others stay inactive.
Several RAM blocks are implements to interleave the on-chip testing and the off-chip communication (through JTAG). The FPU fetches data from the operand RAM and write results to the result RAM synchronized with the system clock, while the programming and the checking are done through JTAG.
We avoid storing operands directly in the instructions. Instead, Operands are stored in the operand RAMs, and only the indices of operands are encoded in the instructions. As a result, the total bits of high-speed registers are greatly reduced.
The following figure shows the floorplan of the chip in the CAD tool. The four FPUs are sitting in the four corners of the chip, and the Clock Gen lies in the middle of the left edge.