Floating-point representation is used to handle real numbers (i.e., numbers that can have fractions) in computer systems. Unlike integers, which are typically represented in whole numbers (binary digits), floating-point numbers have a more complex representation that allows for a much wider range of values, including very small and very large numbers.
Understanding how floating-point numbers are represented and manipulated is crucial for programming, especially in languages that perform complex mathematical calculations, like scientific computing, graphics, and engineering simulations.
In the context of computer organization and assembly language, floating-point numbers are stored in a specific format, and operations such as addition, subtraction, multiplication, and division are handled by the CPU in a specialized way.
A floating-point number is represented in three main components:
The scientific notation of a floating-point number looks like this:
Where:
Here, the mantissa is 1.234 and the exponent is 3.
The IEEE 754 standard is the most commonly used standard for representing floating-point numbers in computers. It defines formats for both single precision and double precision floating-point numbers. These formats specify how to divide a floating-point number into its sign, exponent, and mantissa.
In single precision (32-bit format), the floating-point number is represented as follows:
Structure:
| Sign (1 bit) | Exponent (8 bits) | Mantissa (23 bits) |
|---|---|---|
| 0 or 1 | 8 bits | 23 bits |
In double precision (64-bit format), the floating-point number is represented as:
Structure:
| Sign (1 bit) | Exponent (11 bits) | Mantissa (52 bits) |
|---|---|---|
| 0 or 1 | 11 bits | 52 bits |
Floating-point numbers are typically normalized, meaning the mantissa is scaled such that the leading bit is 1. For example, in binary:
Normalization ensures that the number is represented with the most precision possible.
However, for very small numbers (close to zero), normalization can lead to a loss of precision. In these cases, denormalization (also called subnormal representation) is used, where the exponent is set to the minimum value and the leading bit of the mantissa is 0. This allows the system to represent numbers smaller than what can be normalized.
IEEE 754 also defines representations for certain special cases:
Once floating-point numbers are represented in binary, the CPU uses specialized hardware to perform arithmetic operations like addition, subtraction, multiplication, and division. The operations typically follow these steps:
In assembly language, floating-point operations are handled by special instructions (or a floating-point unit, FPU) designed for high-precision calculations.
For example, in x86 assembly:
Here’s a simple example of adding two floating-point numbers in x86 assembly:
FLD st(0), [num1] ; Load num1 into FPU register stack (st(0))
FLD st(0), [num2] ; Load num2 into FPU register stack (st(0))
FADD st(1), st(0) ; Add st(0) to st(1) and store the result in st(0)
FSTP [result] ; Store the result from st(0) into memory
Floating-point numbers have finite precision, and therefore, errors can arise in calculations due to rounding. IEEE 754 defines several rounding modes:
Floating-point representation in computers allows for the efficient handling of real numbers, enabling complex calculations in scientific and engineering applications. The IEEE 754 standard ensures that floating-point arithmetic is consistent across platforms, although precision issues and rounding errors are inevitable due to finite storage. Understanding how floating-point numbers are represented and manipulated at the hardware level is essential for low-level programming and assembly language operations.
Open this section to load past papers