Lesson 1

Floating Point Binary And Normalisation

This presentation goes over the ideas of SIGN bit, EXPONENT bits and MANTISSA bits.It describes 32 bit floating point and 64 bit floating point numbers.

In C++ we can declare number variables as INT (integer), FLOAT or DOUBLE. FLOAT and INT take up 32 bits and double is 64 bit. Since FLOAT is short for floating point number, the RANGE of values that it can hold is smaller than that of INT except of course it can hold non-integer numbers whereas INT is only whole numbers.

Other languages have similar variable declarations, .NET also includes DECIMAL (128 bits) and for currency calculations this is more ACCURATE at holding fractional numbers. (SINGLE is used in some languages instead of FLOAT.)

Take a read of this article.

Inside the modern computer there is an FPU (Floating Point Unit) that is designed to handle floating point numbers separately from integers. (DECIMAL is handled as an integer that has been scaled. e.g. 1.23 can be stored as 123 that has been scaled by 100, both integer values. DECIMAL would not be handled by the FPU, some embedded computing devices do not have one.)

So how do we work out a floating point number?

Google Presentation

CHECK OUT 'numericalmethodsguy' on youtube. Look-up 'Floating point representation'.

Assignments on EDMODO.

Online decimal-converter

Page updated

Google Sites

Report abuse