ScholarQuill logoScholarQuillUniversity Notes
  • Notes
  • Past Papers
  • Blogs
  • Todo
Login
ScholarQuill logoScholarQuillUniversity Notes
Login
NotesPast PapersBlogsTodo
More
SubjectsDiscussionCGPA CalculatorGPA CalculatorStudent PortalCourse Outline
About
About usPrivacy PolicyReportContact
Notes
Past Papers
Blogs
Todo
Analytics
    Current Subject
    🧩
    Computer Organization and Assembly Language
    COMP2118
    Progress0 / 35 topics
    Topics
    1. Introduction to Computer Systems2. Information is Bits + Context3. Programs are Translated by Other Programs4. Understanding Compilation Systems5. Processors Read and Interpret Instructions6. Caches Matter7. Storage Devices Form a Hierarchy8. The Operating System Manages the Hardware9. Systems Communicate Using Networks10. Representing and Manipulating Information11. Information Storage12. Integer Representations13. Integer Arithmetic14. Floating Point15. Machine-Level Representation of Programs16. A Historical Perspective17. Program Encodings18. Data Formats19. Accessing Information20. Arithmetic and Logical Operations21. Control22. Procedures23. Array Allocation and Access24. Heterogeneous Data Structures25. Understanding Pointers26. Using the GDB Debugger27. Out-of-Bounds Memory References and Buffer Overflow28. x86-64: Extending IA-32 to 64 Bits29. Machine-Level Representations of Floating-Point Programs30. Processor Architecture31. The Y86 Instruction Set Architecture32. Logic Design and the Hardware Control Language (HCL)33. Sequential Y86 Implementations34. General Principles of Pipelining35. Pipelined Y86 Implementations
    COMP2118›Floating Point
    Computer Organization and Assembly LanguageTopic 14 of 35

    Floating Point

    7 minread
    1,219words
    Intermediatelevel

    Floating Point Representation in Computer Organization and Assembly Language

    Floating-point representation is used to handle real numbers (i.e., numbers that can have fractions) in computer systems. Unlike integers, which are typically represented in whole numbers (binary digits), floating-point numbers have a more complex representation that allows for a much wider range of values, including very small and very large numbers.

    Understanding how floating-point numbers are represented and manipulated is crucial for programming, especially in languages that perform complex mathematical calculations, like scientific computing, graphics, and engineering simulations.

    In the context of computer organization and assembly language, floating-point numbers are stored in a specific format, and operations such as addition, subtraction, multiplication, and division are handled by the CPU in a specialized way.

    1. Components of a Floating-Point Number

    A floating-point number is represented in three main components:

    • Sign (S): Indicates whether the number is positive (0) or negative (1).
    • Exponent (E): Determines the magnitude of the number. It essentially "shifts" the decimal point in the number.
    • Mantissa (or Fraction) (M): The significant digits of the number.

    The scientific notation of a floating-point number looks like this:

    N=(−1)S×M×2E−BiasN = (-1)^S \times M \times 2^{E - Bias}N=(−1)S×M×2E−Bias

    Where:

    • SSS is the sign bit (0 for positive, 1 for negative).
    • MMM is the mantissa or fraction.
    • EEE is the exponent.
    • BiasBiasBias is a constant value that allows for both positive and negative exponents (depending on the format being used).

    Example of Scientific Notation:

    Example: 1.234×103\text{Example:} \, 1.234 \times 10^3Example:1.234×103

    Here, the mantissa is 1.234 and the exponent is 3.

    2. IEEE 754 Standard for Floating-Point Representation

    The IEEE 754 standard is the most commonly used standard for representing floating-point numbers in computers. It defines formats for both single precision and double precision floating-point numbers. These formats specify how to divide a floating-point number into its sign, exponent, and mantissa.

    a. Single Precision (32-bit)

    In single precision (32-bit format), the floating-point number is represented as follows:

    • 1 bit for the sign (S)
    • 8 bits for the exponent (E)
    • 23 bits for the mantissa (M)

    Structure:

    Sign (1 bit) Exponent (8 bits) Mantissa (23 bits)
    0 or 1 8 bits 23 bits
    Formula:
    N=(−1)S×1.M×2E−127N = (-1)^S \times 1.M \times 2^{E - 127}N=(−1)S×1.M×2E−127
    • The bias for single precision is 127 (i.e., the exponent is stored with an offset of 127).
    • The mantissa has an implicit leading 1, so it's stored without the leading bit, saving one bit.

    b. Double Precision (64-bit)

    In double precision (64-bit format), the floating-point number is represented as:

    • 1 bit for the sign (S)
    • 11 bits for the exponent (E)
    • 52 bits for the mantissa (M)

    Structure:

    Sign (1 bit) Exponent (11 bits) Mantissa (52 bits)
    0 or 1 11 bits 52 bits
    Formula:
    N=(−1)S×1.M×2E−1023N = (-1)^S \times 1.M \times 2^{E - 1023}N=(−1)S×1.M×2E−1023
    • The bias for double precision is 1023 (i.e., the exponent is stored with an offset of 1023).
    • Like single precision, the mantissa has an implicit leading 1.

    3. Normalization and Denormalization

    Floating-point numbers are typically normalized, meaning the mantissa is scaled such that the leading bit is 1. For example, in binary:

    1.0101×231.0101 \times 2^31.0101×23

    Normalization ensures that the number is represented with the most precision possible.

    However, for very small numbers (close to zero), normalization can lead to a loss of precision. In these cases, denormalization (also called subnormal representation) is used, where the exponent is set to the minimum value and the leading bit of the mantissa is 0. This allows the system to represent numbers smaller than what can be normalized.

    4. Handling Special Cases

    IEEE 754 also defines representations for certain special cases:

    • Zero: Represented by an exponent of all 0s and a mantissa of all 0s. The sign bit determines if it's +0 or -0.
    • Infinity: Represented by an exponent of all 1s and a mantissa of all 0s. The sign bit determines if it's positive or negative infinity.
    • NaN (Not a Number): Represented by an exponent of all 1s and a non-zero mantissa. NaN is used for undefined or unrepresentable results (e.g., 0/0 or the square root of a negative number).
    • Subnormal Numbers: Represent very small values that are too small to be normalized. These numbers are represented with an exponent of all 0s and a non-zero mantissa.

    5. Floating-Point Arithmetic Operations

    Once floating-point numbers are represented in binary, the CPU uses specialized hardware to perform arithmetic operations like addition, subtraction, multiplication, and division. The operations typically follow these steps:

    a. Addition and Subtraction:

    1. Align the exponents: Adjust the exponent of the smaller number so that both numbers have the same exponent.
    2. Perform the addition or subtraction on the mantissas.
    3. Normalize the result if needed, adjusting the exponent.
    4. Handle rounding if necessary.

    b. Multiplication:

    1. Multiply the mantissas.
    2. Add the exponents.
    3. Normalize the result and adjust the exponent.
    4. Handle rounding and possible overflow/underflow.

    c. Division:

    1. Divide the mantissas.
    2. Subtract the exponents.
    3. Normalize the result and adjust the exponent.
    4. Handle rounding and possible overflow/underflow.

    6. Assembly and Floating-Point Operations

    In assembly language, floating-point operations are handled by special instructions (or a floating-point unit, FPU) designed for high-precision calculations.

    For example, in x86 assembly:

    • FLD: Load a floating-point value into the FPU stack.
    • FADD: Add two floating-point numbers.
    • FMUL: Multiply two floating-point numbers.
    • FDIV: Divide two floating-point numbers.
    • FSUB: Subtract two floating-point numbers.

    Here’s a simple example of adding two floating-point numbers in x86 assembly:

    FLD    st(0), [num1]  ; Load num1 into FPU register stack (st(0))
    FLD    st(0), [num2]  ; Load num2 into FPU register stack (st(0))
    FADD   st(1), st(0)   ; Add st(0) to st(1) and store the result in st(0)
    FSTP   [result]       ; Store the result from st(0) into memory
    

    7. Precision and Rounding

    Floating-point numbers have finite precision, and therefore, errors can arise in calculations due to rounding. IEEE 754 defines several rounding modes:

    • Round to nearest: The result is rounded to the nearest representable number.
    • Round toward zero: The result is rounded towards zero (truncated).
    • Round up (toward +∞): The result is rounded towards positive infinity.
    • Round down (toward -∞): The result is rounded towards negative infinity.

    Conclusion

    Floating-point representation in computers allows for the efficient handling of real numbers, enabling complex calculations in scientific and engineering applications. The IEEE 754 standard ensures that floating-point arithmetic is consistent across platforms, although precision issues and rounding errors are inevitable due to finite storage. Understanding how floating-point numbers are represented and manipulated at the hardware level is essential for low-level programming and assembly language operations.

    Previous topic 13
    Integer Arithmetic
    Next topic 15
    Machine-Level Representation of Programs

    Past Papers

    Open this section to load past papers

    Click on Show Past Papers to see past papers.
    On This Page
      Reading Stats
      Est. reading time7 min
      Word count1,219
      Code examples0
      DifficultyIntermediate