Putting the Booth type, the tree type, and the multiplier bit width together provides an interesting design tradeoff between power, area and performance. Because of the object oriented nature of generators, with constructors for each module, we are able to support and explore any combination of the three. The generator also allows us to encode valuable placement hints at the design stage which dramatically improves the quality of the results in terms of timing, area and power.
The generator allows us to embed design knowledge all the way starting from design abstraction all the way down to layout realization using alternative methods and algorithms.
Energy-delay tradeoff for half-precision multipliers (11-bit multipliers) in 45nm. Wallace Booth 2 is the most optimal across all ranges.
Energy-delay tradeoff for single-precision multipliers (24-bit multipliers) in 45nm. Wallace Booth 2 is still the most optimal across all ranges since wires are negligible compared to gate strengths.
Energy-delay tradeoff for double-precision multipliers (53-bit multipliers) in 45nm. Since it is a bigger design and is dominated by wires, Overturned Staircase becomes more efficient than Wallace since it has shorter wires and booth 3 starts to show as an alternative for low energy designs.
Energy-delay tradeoff for quad-precision multipliers (113-bit multipliers) in 45nm. Since it is a bigger design and is dominated by wires, Overturned Staircase becomes more efficient than Wallace since it has shorter wires and booth 3 starts to show as an alternative for low energy designs.
Using the smaller generators, we turn now to putting it all together for the FMA unit in the heart of our FPU generator, and then exploring the design space. At this level, our generator is able to produce either a fused mul-add or
a cascade architecture that allows early issue of accumulation dependent instructions with possibilities for more aggressive clock gating. We further support early forwarding of unrounded results as first implemented in the Power6 FPU. To reduce dynamic power, all sub-units are clock gated based on instruction type for NOP, FADD, FMUL, FMADD.
Throughput tradeoffs for FMA and CMA: (left) single precision (right) double precision. Clearly fused architectures (FMAs) perform better. The reason is that cascade architectures trade additional logic for latency savings. On the Pareto curve, most designs are Booth 3 because Booth 3 minimizes the area especially for large N. The tree structure, on the other hand, matters less as it is more prominent in determining the delay, not the area or energy.
Latency tradeoffs for FMA and CMA: (left) single precision (right) double precision. The cascade designs (CMAs) performsbetter because the latency of a dependent instruction is smaller than in the fused architecture case, and because the latency of add (without mul) instructions is also reduced. Moreover, these represent thousands of runs among various configurations, only this time the designs that made it to the Pareto curve are mostly Wallace trees (with a few OS1 entries), and mostly Booth 2. Pipe depth varied from 5 to 12 and even 16.