Floating Point Representation

Course Content Specification

At present we have used binary to represent integers (whole numbers) and two's complement to store negative numbers. But there is a problem at present we cannot store real numbers (numbers with decimal portions). It is also cumbersome to store large integers.

We are accustomed to using a fixed notation where the decimal point is fixed and we know that any numbers to the right of the decimal point are the decimal portion and to the left is the integer part.

E.g. 10.75 

10 is the Integer Portion and 0.75 is the decimal portion. To get around this problem the computer uses Floating Point Representation.

The number above demonstrates the location of the mantissa and exponent.

So in this instance the mantissa would be 25 and the exponent would be 5

The computer only stores the Mantissa and the Exponent. It does not need to store the base as it already knows that this will always be 2, this saves memory.

Advantage of floating point

It takes up less space for storing large numbers and allows real numbers to be stored.

Disadvantage of floating point

The main disadvantage of floating point is that the computer has to split its storage space between the mantissa and the exponent. This means that the mantissa can cause rounding errors if not enough room is assigned to it. There has to be a tradeoff between accuracy and the range of numbers that we can represent.

Floating Point Number Structure

There is an IEEE (Institute of Electrical and Engineers) standard that defines the structure of a floating point number. It is IEEE754-2008. It defines 4 sizes of floating point numbers.

There are 4 sizes of numbers defined:

A 32 bit floating number (single precision) has the following structure.

Lesson Video - Floating Point Representation

Floating Point Representations.MP4

A worked example

In decimal first

250.03125

First you convert the integer part of the mantissa into binary (as you have done previously)

250 = 1111 1010

Now to convert the decimal portion of the mantissa (although this would usually be done in the exam for you.

Decimal fraction => .03125

Multiply and use any remainder over 1 as a carry forward. Continue until you reach 1.0 with no carry over

0.03125 * 2 = 0 r 0.0625

0.0625 * 2 = 0 r 0.125

0.125 * 2 = 0 r 0.25

0.25 * 2 = 0 r 0.5

0.5 * 2 = 1 r 0

Binary fraction = 0.00001

So far we have  : 1111 1010.00001 (250.03125)

But we need it in the format  .11111 0100 0001 (the decimal point to the left of the first 1)

So back to our example

Sign Bit = 0

Mantissa =.11111 0100 0001 (.25003125)

Exponent =  0000 1000 (8)

And the number is positive so the sign bit is 0

What about small numbers?

What about small numbers?

If we are trying to convert the number: 0.0625

In binary this would be 0.001

The leading bit after the . has to be a 1  so this time the decimal point has to move to the right, which means it is a negative number.

So as the exponent is -2  this would be stored using two's complement notation (link here for reminder)

Tutorial Video on creating the Decimal Portion

Although unlikely to be asked in the exam I have put together a small video on how to create the decimal port of a floating point number - such as the 0.5 in 12.5

Creating the decimal portion.MP4

Floating Point Worked Examples

Worked Example 1

We are using 

 102.9375 = 1100110.1111 

Sign = 0 (+ve)

Number = 1100110.1111 -> Needs to be .11001101111

Exponent = 7 = 0000 0111

Number = 0  110011011110000 00000111 

Worked Example 2

We are using 

250.75 = 11111010.11 

Sign = 0 (+ve)

Number = 11111010.11 -> Needs to be .1111101011

Exponent = 8 = 00001000

Number = 0  111110101100000 00001000 

Worked Example 3 - negative number

We are using 

0.0009765625 =  0.0000000001 

Sign = 0 (+ve)

Number = 0.0000000001 

Exponent = -9 = 1111 0111 

Number = 0  101 0000 0000 0100 1111 0111  

Further Practice

Convert the following decimal numbers into single precision floating point numbers:

Allocating more bits to Mantissa

As can be seen from the example earlier the more bytes that we have for the mantissa means we can represent decimal fractions more accurately.

Allocating more bits to Exponent

Whereas the number of bytes used for the exponent means we can move the decimal point more places which means we can represent a larger range of numbers.

Mnemonic:  MARE (Mantissa Accuracy Range Exponent)