Every beginning programmer, after they finish their "hello-world" program and want to do some useful calculation, are told to use these weird keywords: float or double to represent their numbers. This unfortunate naming is based on how computers approximate real numbers. The ability to perform billion of calculations per second on real numbers makes modern computers extremely useful. 3D graphics, virtual worlds for gaming, scientific simulations and computer aided design (CAD) tools are just a subset of applications made possible by such a capability. Computers use binary floating-point numbers as a finite approximation to normalized scientific representation of real numbers. In the past, several formats of different precisions and conventions have been used, but fortunately the format and the operations defined on them have been standardized by IEEE754 in 1985 and universally adopted by industry. The most common formats are the 32 bits single-precision (float) and 64 bit double-precision (double) formats. Computer performance is measured routinely by the FLoating point Operations Per Second (FLOPS) with current top supercomputers achieving performance in the excess of 10 petaFLOPS.
Initially a dedicated hardware unit for floating-point calculations could not fit on processor chips and had to be implemented on a separate chip starting with the 1980 Intel 8087 co-processor chip. With scaling, it became economical for subsequent processors to integrate the floating-point unit (FPU) on the same chip starting with the 1989 Intel 486 processor. In 1990, IBM introduced RS/6000, the first processor with FPU based on fused multiply-add (FMA) dataflow that supports calculation of A x B + C in a single operation offering increased precision and speed. To this day, most Processor designs have their FPUs implemented using a separate multiplication and addition pipeline. This is starting to change since the FMA operation has been added to the 2008 revision of IEEE 754 floating-point standard. New Intel Haswell as well as new AMD processors added support for FMA instructions.
Intel 8086 Microprocessor on same module with Intel C8087 FPU
co-processor chip. 1980 (source: http://www.cpu-galaxy.at)
Graphics Processing Units (GPUs) were introduced to provide dedicated hardware for 3D graphics performance acceleration. Initial designs were fixed function and provided little programability. However recent designs employ thousands of mostly single precision FPUs to provide the high throughput needed for parallel graphics workloads. These highly parallel compute power has been exposed for general programming using languages such as Cuda and OpenCL. On the other hand, scientific computations which use double precision FP calculations which is commonly supplied by CPUs. Finally there is an increased competition of mobile chips that employ fast and energy efficient graphics calculation using only 16-bits half precision to preserve energy. So there is different classes of applications and precisions that different FPUs need to be designed for, and what might be the most efficient way to build a double precision multiplier doesn't need to necessary to be good for half precision one. So with this explosion of variation of different FPUs needed, the right approach is to encode all designer knowledge and tricks into a generator and use it to choose the best parameters for every design configuration, and that's why we built the FPU generator.
Recent GPUs pack an order of magnitude higher theoretical single-precision floating-point performance than CPUs.