November 7

Post date: Nov 15, 2014 9:50:20 PM

Floating-point representation

In floating-point representation, either decimal or binary, a number is made up of three sections, sign (positive or negative), shifter (how many places the decimal point should be shifted) and fixed-point number (the position of the decimal).

example 3.18

Shows the decimal number 7,425,000,000,000,000,000,000.00 in scientific notation.

Scientific notation=+7.425*10²¹ or +7.425E21.

example 3.19

Shows the number -0.000000000000000232 in scientific notation.

Scientific notation=-2.32*10^-14 or -2.32E-14.

example 3.20

Shows the number (1010010000000000000000000000000.00)₂ in floating-point format.

Solution=+1.01001*2³².

example 3.21

Shows the number -(0.00000000000000000000000101)₂ in floating-point format.

Solution=-1.01*2^-24.

Normalization

decimal

binary

+/- d.xxxxx

+/- 1.yyyyy

note: d is 1 to 9 and each x is 0 to 9

note: each y is 0 or 1

Sign, Exponent, Mantissa

Sign: Stored using 1 bit (0 or 1).

Exponent: Power of 2 to shift the decimal point. Note that the power can be positive or negative. The Excess representation is the method used to store the exponent.

Mantissa: The binary integer to the right of the decimal point. The mantissa is a fractional part that, together with the sign, is treated like an integer stored in sign-and-magnitude representation.

The Excess system

In this system, both positive and negative integers are stored as unsigned integers. To represent a positive or negative integer, a positive integer (called a bias) is added to each number to shift them uniformly to the non-negative side. The value of this bias is 2^m-1-1, where m is the size of the memory location to store the exponent.

IEEE standards

The Institute of Electrical and Electronics Engineers (IEEE) has defined several standards for storing floating-point numbers. Two most commonly used, single precision and double precision.

Storing of IEEE standard floating point numbers

Store the sign in S (0 or 1) and change the number to binary, normalize it. Then find the values of E and M and last, concatenate S, E and M.

example 3.23

Show the Excess_127 representation of the decimal number 5.75.

a. S=0 (positive).

b. Binary=(101.11)₂.

c. Normalization=(1.0111)₂*2².

d. E=2+127=129=(10000001)₂, M=(0111)₂ and adding nineteen zeroes at the right of M to make it 23 bits.

e. Hence, the presentation becomes 01000000101110000000000000000000.

example 3.24

Show the Excess_127 representation of the decimal number -161.875.

a. S=1 (negative).

b. Binary=(10100001.111)₂.

c. Normalization=(1.0100001111)₂*2⁷.

d. E=7+127=134=(10000110)₂, M=(0100001111)₂ and adding thirteen zeroes at the right of M to make it 23 bits.

e. Hence, the presentation becomes 11000011001000011110000000000000.

example 3.25

Show the Excess_127 representation of the decimal number -0.0234375.

a. S=1 (negative).

b. Binary=(0.0000011)₂.

c. Normalization=(1.1)₂*2^-6.

d. E=-6+127=121=(01111001)₂, M=(1)₂ and adding twenty-two zeroes at the right of M to make it 23 bits.

e. Hence, the presentation becomes 10111100110000000000000000000000.

Retrieving numbers stored in IEEE standard floating point format

Find the value of S, E and M. If S=0, set the sign to positive, otherwise, set the sign to negative. Then find the shifter (E-127). Denormalize the mantissa and change the denormalized number to binary to find the absolute value. Last, add the sign.

example 3.26

Show the decimal notation of (11001010000000000111000100001111)₂ in Excess_127.

a. Divided it into S, E, M.

S=1, E=10010100, M=00000000111000100001111

b. The sign is negative.

c. The shifter=E-127=148-127=21.

d. Denormalization=(1.00000000111000100001111)₂*2²¹.

e. The binary number=(1000000001110001000011.11)₂.

f. The absolute value=2104378.75.

g. The number=-2104378.75.

Overflow and Underflow

This representation cannot store numbers with very small or very large absolute values. An attempt to store numbers with very small absolute values results in an underflow condition, while an attempt to store numbers with very large absolute values results to an overflow condition. We leave the calculation of boundary values (+largest, -largest, +smallest, -smallest) as problems.

Logic Operations at bit level

A bit can take one of the two values, 0 or 1. If we interpret 0 as the value false and 1 as the value true, we can apply the operations defined in Boolean Algebra.

Page updated

Google Sites

Report abuse