Check The Programming Section

Floating-point Representation

The Floating-point Standard

For over the years, a variety of floating-point representation is used in the Computer System. The standard Floating-Point Arithmetic was first established in 1985 by IEEE 754 standard. Since 1990, this standard is the most commonly encountered representations are those defined by the IEEE. A floating-point number is, in general, represented approximately to a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:

Significand X base^exponent

where significand, base, and exponent represents an integer number and the base is equal or greater than 2 (two). For example,

1.2345 = 12345 X 10^-5

where significand, base, and exponent is 12345, 10 and -5 respectively. The term float means the decimal point (in Computer the binary point) can be float in between digits of significand, which determines the components for exponent. This type representation can be thought of as scientific notation.

The binary representation of Single Floating-point Number

Consider a fractional decimal number as 9.2345 and let’s consider the following steps mentioned below to convert this number to a binary number.

Steps to convert the floating-point number to binary

First convert the integer decimal part to an equivalent binary number
Then convert the fractional part to equivalent binary number to the given precision point. For the number 9.2345, the precision point is 4.

1. Conversion of the integer part to binary

The process will start by dividing the number by 2 and save the remainder.
Divide the quotient by 2 till it's become equal to zero (0).

So the binary equivalent of the integer part is 1001 and placed the remainder from bottom to top.

2. Conversion of the fractional part to binary

This process will start by multiplying the functional part by 2
The integral part of the multiplication result will be the first number of the binary number.
The above process will repeat till the desired precision point.
The single-floating point number supports upto 6th decimal place

So, the equivalent binary number of the fractional part is 001111, by placing the integral number from top to bottom. The complete binary number will be 1001.001111. This would be represent in scientific form as 1.001001111 X 2³.

Scientific Representation of Single Floating-point number in Binary

The float-point takes 32 bits and a scientific representation of a floating-point number have three parts as follows:

A sign bit for representing a number as positive or negative as 0 or 1
The exponent part is comprises of 8 bits and
The mantissa part is comprises of 23 bits

Sign Field

The above number 9.2345 is a positive number. So the sign filed will be zero (0).

Exponent Field

The second part of the scientific form is exponent part. To represent an expnent part in binary we have maximum 8 bits and can have a maximum number is 255. Now you might think, how would be the case of negative exponent number? To cover the negative exponent value, the exponent is actually 127 greater than the real exponent e of the term 2^ein the scientific form. So the exponent value will be 127 + e and in this case it will be 127 + 3 = 130. Lets convert the 130 to binary, which is equal to 10000010.

Mantissa Field

The mantissa field also called as significand. The mantissa is 23 bits long and in this example 1.001001111 X 2³mantissa is the left part of the exponent and upto the decimal point only.

Scientific Representation of Double Floating-Point Number in Binary

Instead of 32-bits, double floating-point have 64 bits memory representation with the following field length for sign, exponent and mantissa

Like single floating-point, sign fied will take 1 bit only
The exponent field would takes 11 bits long instead of 8.
The mantissa field will consider remaining 52 bits long filed size.
The exponent filed contains a value larger than 1023 and to get the equivalent binary number of the exponent, we would add the true exponent e with 1023.

References

Page updated

Google Sites

Report abuse