Floating Point Representation
Course Content Specification
Describe and exemplify floating-point representation of positive and negative real numbers, using the terms mantissa and exponent.
Describe the relationship between the number of bits assigned to the mantissa/exponent, and the range and precision of floating-point numbers.
At present we have used binary to represent integers (whole numbers) and two's complement to store negative numbers. But there is a problem at present we cannot store real numbers (numbers with decimal portions). It is also cumbersome to store large integers.
We are accustomed to using a fixed notation where the decimal point is fixed and we know that any numbers to the right of the decimal point are the decimal portion and to the left is the integer part.
E.g. 10.75
10 is the Integer Portion and 0.75 is the decimal portion. To get around this problem the computer uses Floating Point Representation.
The number above demonstrates the location of the mantissa and exponent.
So in this instance the mantissa would be 25 and the exponent would be 5
The computer only stores the Mantissa and the Exponent. It does not need to store the base as it already knows that this will always be 2, this saves memory.
Advantage of floating point
It takes up less space for storing large numbers and allows real numbers to be stored.
Disadvantage of floating point
The main disadvantage of floating point is that the computer has to split its storage space between the mantissa and the exponent. This means that the mantissa can cause rounding errors if not enough room is assigned to it. There has to be a tradeoff between accuracy and the range of numbers that we can represent.
Floating Point Number Structure
There is an IEEE (Institute of Electrical and Engineers) standard that defines the structure of a floating point number. It is IEEE754-2008. It defines 4 sizes of floating point numbers.
There are 4 sizes of numbers defined:
16 bit sometimes known as Half precision
32 bit sometimes known as Single precision
64 bit sometimes known as Double precision
128 bit sometimes known as Quadruple precision
A 32 bit floating number (single precision) has the following structure.
Sign bit and Mantissa (24 bits) - the sign part is considered part of the mantissa
Exponent 8 bits
Lesson Video - Floating Point Representation
A worked example
In decimal first
250.03125
First you convert the integer part of the mantissa into binary (as you have done previously)
250 = 1111 1010
Now to convert the decimal portion of the mantissa (although this would usually be done in the exam for you.
Decimal fraction => .03125
Multiply and use any remainder over 1 as a carry forward. Continue until you reach 1.0 with no carry over
0.03125 * 2 = 0 r 0.0625
0.0625 * 2 = 0 r 0.125
0.125 * 2 = 0 r 0.25
0.25 * 2 = 0 r 0.5
0.5 * 2 = 1 r 0
Binary fraction = 0.00001
So far we have : 1111 1010.00001 (250.03125)
But we need it in the format .11111 0100 0001 (the decimal point to the left of the first 1)
So back to our example
Sign Bit = 0
Mantissa =.11111 0100 0001 (.25003125)
Exponent = 0000 1000 (8)
And the number is positive so the sign bit is 0
What about small numbers?
What about small numbers?
If we are trying to convert the number: 0.0625
In binary this would be 0.001
The leading bit after the . has to be a 1 so this time the decimal point has to move to the right, which means it is a negative number.
So as the exponent is -2 this would be stored using two's complement notation (link here for reminder)
Tutorial Video on creating the Decimal Portion
Although unlikely to be asked in the exam I have put together a small video on how to create the decimal port of a floating point number - such as the 0.5 in 12.5
Floating Point Worked Examples
Worked Example 1
We are using
8 bits for the exponent
16 bits for the mantissa
(1 is the sign bit)
102.9375 = 1100110.1111
Sign = 0 (+ve)
Number = 1100110.1111 -> Needs to be .11001101111
Exponent = 7 = 0000 0111
Number = 0 110011011110000 00000111
Worked Example 2
We are using
8 bits for the exponent
16 bits for the mantissa
(1 is the sign bit)
250.75 = 11111010.11
Sign = 0 (+ve)
Number = 11111010.11 -> Needs to be .1111101011
Exponent = 8 = 00001000
Number = 0 111110101100000 00001000
Worked Example 3 - negative number
We are using
8 bits for the exponent
16 bits for the mantissa
(1 is the sign bit)
0.0009765625 = 0.0000000001
Sign = 0 (+ve)
Number = 0.0000000001
Exponent = -9 = 1111 0111
Number = 0 101 0000 0000 0100 1111 0111
Allocating more bits to Mantissa
As can be seen from the example earlier the more bytes that we have for the mantissa means we can represent decimal fractions more accurately.
Allocating more bits to Exponent
Whereas the number of bytes used for the exponent means we can move the decimal point more places which means we can represent a larger range of numbers.
Mnemonic: MARE (Mantissa Accuracy Range Exponent)