Check The Programming Section
For over the years, a variety of floating-point representation is used in the Computer System. The standard Floating-Point Arithmetic was first established in 1985 by IEEE 754 standard. Since 1990, this standard is the most commonly encountered representations are those defined by the IEEE. A floating-point number is, in general, represented approximately to a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:
Significand X baseexponent
where significand, base, and exponent represents an integer number and the base is equal or greater than 2 (two). For example,
1.2345 = 12345 X 10-5
where significand, base, and exponent is 12345, 10 and -5 respectively. The term float means the decimal point (in Computer the binary point) can be float in between digits of significand, which determines the components for exponent. This type representation can be thought of as scientific notation.
Consider a fractional decimal number as 9.2345 and let’s consider the following steps mentioned below to convert this number to a binary number.
First convert the integer decimal part to an equivalent binary number
Then convert the fractional part to equivalent binary number to the given precision point. For the number 9.2345, the precision point is 4.
The process will start by dividing the number by 2 and save the remainder.
Divide the quotient by 2 till it's become equal to zero (0).
So the binary equivalent of the integer part is 1001 and placed the remainder from bottom to top.
This process will start by multiplying the functional part by 2
The integral part of the multiplication result will be the first number of the binary number.
The above process will repeat till the desired precision point.
The single-floating point number supports upto 6th decimal place
So, the equivalent binary number of the fractional part is 001111, by placing the integral number from top to bottom. The complete binary number will be 1001.001111. This would be represent in scientific form as 1.001001111 X 23.
The float-point takes 32 bits and a scientific representation of a floating-point number have three parts as follows:
A sign bit for representing a number as positive or negative as 0 or 1
The exponent part is comprises of 8 bits and
The mantissa part is comprises of 23 bits
The above number 9.2345 is a positive number. So the sign filed will be zero (0).
The second part of the scientific form is exponent part. To represent an expnent part in binary we have maximum 8 bits and can have a maximum number is 255. Now you might think, how would be the case of negative exponent number? To cover the negative exponent value, the exponent is actually 127 greater than the real exponent e of the term 2e in the scientific form. So the exponent value will be 127 + e and in this case it will be 127 + 3 = 130. Lets convert the 130 to binary, which is equal to 10000010.
The mantissa field also called as significand. The mantissa is 23 bits long and in this example 1.001001111 X 23 mantissa is the left part of the exponent and upto the decimal point only.
Instead of 32-bits, double floating-point have 64 bits memory representation with the following field length for sign, exponent and mantissa
Like single floating-point, sign fied will take 1 bit only
The exponent field would takes 11 bits long instead of 8.
The mantissa field will consider remaining 52 bits long filed size.
The exponent filed contains a value larger than 1023 and to get the equivalent binary number of the exponent, we would add the true exponent e with 1023.