Floating-Point literature is rich with implementation tricks and concepts to accelerate FP calculations. For example, let's look at multipliers. Multipliers are an important arithmetic component of FPUs and consume sizable area and energy resources. A double precision fused multiply-add unit's multiplier can account for 45% of its energy consumption.
Multiplication is usually implemented by summing up partial products (PPs), or shifted multiplicands scaled by multiplier digits, similar to how elementary school students perform long multiplication. For example when multiplying 123 x 214, the partial products are: 3 x 214=642, 20 x 214=4280 and 100 x 214=21400 and the final product is 642 + 4280 + 21400 = 26322. Similarly, binary multiplication is divided into the generation of PPs and the addition of PPs.
For multiplying two numbers x and y, the simplest partial product generation method is to ANDs each multiplicand x with the ith bit from the multiplier y to create n partial products. Each partial product PPi is logically shifted by i positions to the left. The final result is the summation of these n vectors.
One can half the number of partial products by generating a PP for every group of 2 bits instead of a single bit. As such, each PP can can be either 0, x, 2x or 3x . Generating the 2x multiple is easy by shifting x one bit to the left. However generating the 3x multiple is harder and adds to the multiplier delay.
Modified Booth 2 algorithms address this issue by looking at overlapping groups of 3 multiplier bits (y2(i + 1) − 1:2i − 1) and generating multiples from -2x, -x, 0, x, 2x. This means that there are fewer multiples to generate because the negative multiples are generated by inverting the positive multiples and adding one at the LSB position (-x = ~x + 1 and -2x = ~ (2x) +1).
This idea can be generalized by looking at more bits for generating every partial product. While this reduces the number of partial product, the logic for choosing between different multiples doubles for every extra bit limiting the utility of this approach. This is illustrated in the following table that shows the supported PP generation options.
Comparison of optimized implementations of unsigned group encoding and modified Booth encoding schemes. All areas and delays are reported in comparison to the area and delay of one CSA cell. Modified Booth 2 has the lowest delay for most values of n while modified Booth 3 provides 10% improvement in area and thus energy for large values of n. Unsigned group encoding 2 and modified Booth 4 give similar area savings but have worse delays than modified Booth 3.
Once partial products are generated, they are successively reduced using carry save adders (CSA) until only 2 partial products are left which are then fed into a carry propagate adder (CPA) to generate the final result. The partial products may be reduced serially using arrays interconnection which has minimum wires and maximum delay or in parallel using irregular Wallace tree which has maximum wire length and minimum delay. Intermediate approaches of regular trees such as overturned staircase or ZM trees are also supported as PP reduction options.