COMP2118›Floating Point

Computer Organization and Assembly LanguageTopic 14 of 35

Floating Point

7 minread

1,219words

Intermediatelevel

Floating Point Representation in Computer Organization and Assembly Language

Floating-point representation is used to handle real numbers (i.e., numbers that can have fractions) in computer systems. Unlike integers, which are typically represented in whole numbers (binary digits), floating-point numbers have a more complex representation that allows for a much wider range of values, including very small and very large numbers.

Understanding how floating-point numbers are represented and manipulated is crucial for programming, especially in languages that perform complex mathematical calculations, like scientific computing, graphics, and engineering simulations.

In the context of computer organization and assembly language, floating-point numbers are stored in a specific format, and operations such as addition, subtraction, multiplication, and division are handled by the CPU in a specialized way.

1. Components of a Floating-Point Number

A floating-point number is represented in three main components:

Sign (S): Indicates whether the number is positive (0) or negative (1).
Exponent (E): Determines the magnitude of the number. It essentially "shifts" the decimal point in the number.
Mantissa (or Fraction) (M): The significant digits of the number.

The scientific notation of a floating-point number looks like this:

N = (-1)^S \times M \times 2^{E - Bias}

Where:

$S$ is the sign bit (0 for positive, 1 for negative).
$M$ is the mantissa or fraction.
$E$ is the exponent.
$Bias$ is a constant value that allows for both positive and negative exponents (depending on the format being used).

Example of Scientific Notation:

\text{Example:} \, 1.234 \times 10^3

Here, the mantissa is 1.234 and the exponent is 3.

2. IEEE 754 Standard for Floating-Point Representation

The IEEE 754 standard is the most commonly used standard for representing floating-point numbers in computers. It defines formats for both single precision and double precision floating-point numbers. These formats specify how to divide a floating-point number into its sign, exponent, and mantissa.

a. Single Precision (32-bit)

In single precision (32-bit format), the floating-point number is represented as follows:

1 bit for the sign (S)
8 bits for the exponent (E)
23 bits for the mantissa (M)

Structure:

Sign (1 bit)	Exponent (8 bits)	Mantissa (23 bits)
0 or 1	8 bits	23 bits

Formula:

N = (-1)^S \times 1.M \times 2^{E - 127}

The bias for single precision is 127 (i.e., the exponent is stored with an offset of 127).
The mantissa has an implicit leading 1, so it's stored without the leading bit, saving one bit.

b. Double Precision (64-bit)

In double precision (64-bit format), the floating-point number is represented as:

1 bit for the sign (S)
11 bits for the exponent (E)
52 bits for the mantissa (M)

Structure:

Sign (1 bit)	Exponent (11 bits)	Mantissa (52 bits)
0 or 1	11 bits	52 bits

Formula:

N = (-1)^S \times 1.M \times 2^{E - 1023}

The bias for double precision is 1023 (i.e., the exponent is stored with an offset of 1023).
Like single precision, the mantissa has an implicit leading 1.

3. Normalization and Denormalization

Floating-point numbers are typically normalized, meaning the mantissa is scaled such that the leading bit is 1. For example, in binary:

1.0101 \times 2^3

Normalization ensures that the number is represented with the most precision possible.

However, for very small numbers (close to zero), normalization can lead to a loss of precision. In these cases, denormalization (also called subnormal representation) is used, where the exponent is set to the minimum value and the leading bit of the mantissa is 0. This allows the system to represent numbers smaller than what can be normalized.

4. Handling Special Cases

IEEE 754 also defines representations for certain special cases:

Zero: Represented by an exponent of all 0s and a mantissa of all 0s. The sign bit determines if it's +0 or -0.
Infinity: Represented by an exponent of all 1s and a mantissa of all 0s. The sign bit determines if it's positive or negative infinity.
NaN (Not a Number): Represented by an exponent of all 1s and a non-zero mantissa. NaN is used for undefined or unrepresentable results (e.g., 0/0 or the square root of a negative number).
Subnormal Numbers: Represent very small values that are too small to be normalized. These numbers are represented with an exponent of all 0s and a non-zero mantissa.

5. Floating-Point Arithmetic Operations

Once floating-point numbers are represented in binary, the CPU uses specialized hardware to perform arithmetic operations like addition, subtraction, multiplication, and division. The operations typically follow these steps:

a. Addition and Subtraction:

Align the exponents: Adjust the exponent of the smaller number so that both numbers have the same exponent.
Perform the addition or subtraction on the mantissas.
Normalize the result if needed, adjusting the exponent.
Handle rounding if necessary.

b. Multiplication:

Multiply the mantissas.
Add the exponents.
Normalize the result and adjust the exponent.
Handle rounding and possible overflow/underflow.

c. Division:

Divide the mantissas.
Subtract the exponents.
Normalize the result and adjust the exponent.
Handle rounding and possible overflow/underflow.

6. Assembly and Floating-Point Operations

In assembly language, floating-point operations are handled by special instructions (or a floating-point unit, FPU) designed for high-precision calculations.

For example, in x86 assembly:

FLD: Load a floating-point value into the FPU stack.
FADD: Add two floating-point numbers.
FMUL: Multiply two floating-point numbers.
FDIV: Divide two floating-point numbers.
FSUB: Subtract two floating-point numbers.

Here’s a simple example of adding two floating-point numbers in x86 assembly:

FLD    st(0), [num1]  ; Load num1 into FPU register stack (st(0))
FLD    st(0), [num2]  ; Load num2 into FPU register stack (st(0))
FADD   st(1), st(0)   ; Add st(0) to st(1) and store the result in st(0)
FSTP   [result]       ; Store the result from st(0) into memory

7. Precision and Rounding

Floating-point numbers have finite precision, and therefore, errors can arise in calculations due to rounding. IEEE 754 defines several rounding modes:

Round to nearest: The result is rounded to the nearest representable number.
Round toward zero: The result is rounded towards zero (truncated).
Round up (toward +∞): The result is rounded towards positive infinity.
Round down (toward -∞): The result is rounded towards negative infinity.

Conclusion

Floating-point representation in computers allows for the efficient handling of real numbers, enabling complex calculations in scientific and engineering applications. The IEEE 754 standard ensures that floating-point arithmetic is consistent across platforms, although precision issues and rounding errors are inevitable due to finite storage. Understanding how floating-point numbers are represented and manipulated at the hardware level is essential for low-level programming and assembly language operations.

Previous topic 13

Integer Arithmetic

Next topic 15

Machine-Level Representation of Programs

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.

Sign (1 bit)

Exponent (8 bits)

Mantissa (23 bits)

0 or 1

8 bits

23 bits

Sign (1 bit)

Exponent (11 bits)

Mantissa (52 bits)

0 or 1

11 bits

52 bits

FLD st(0), [num1] ; Load num1 into FPU register stack (st(0)) FLD st(0), [num2] ; Load num2 into FPU register stack (st(0)) FADD st(1), st(0) ; Add st(0) to st(1) and store the result in st(0) FSTP [result] ; Store the result from st(0) into memory