Floating-point calculations shall be handled as described in IEC 60559:1989. This standard describes encoding of floating-point numbers, definitions of the basic operations and conversion, rounding control, and exception handling.
The standard defines special values, NaN (not a number), +infinity, and –infinity. These values are returned on overflow conditions. A general principle is that operations that have a value in the limit return an appropriate infinity while those that have no limiting value return NaN (see the standard for details).
[Note: The following examples show the most commonly encountered cases.
X rem 0 = NaN
0 * +infinity = 0 * -infinity = NaN
(X / 0) = +infinity, if X > 0
NaN, if X = 0
-infinity, if X < 0
NaN op X = X op NaN = NaN for all operations
(+infinity) + (+infinity) = (+infinity)
X / (+infinity) = 0
X mod (-infinity) = -X
(+infinity) - (+infinity) = NaN
This standard does not specify the behavior of arithmetic operations on denormalized floatingpoint numbers, nor does it specify when or whether such representations should be created. This is in keeping with IEC 60559:1989. In addition, this standard does not specify how to access the exact bit pattern of NaNs that are created, nor the behavior when converting a NaN between 32-bit and 64-bit representation. All of this behavior is deliberately left implementation-specific. end note]
For purposes of comparison, infinite values act like a number of the correct sign, but with a very large magnitude when compared with finite values. For comparison purposes, NaN is 'unordered' (see clt
, clt.un
).
While the IEC 60559:1989 standard also allows for exceptions to be thrown under unusual conditions (such as overflow and invalid operand), the CLI does not generate these exceptions. Instead, the CLI uses the NaN, +infinity, and –infinity return values and provides the instruction ckfinite
to allow users to generate an exception if a result is NaN, +infinity, or –infinity.
The rounding mode defined in IEC 60559:1989 shall be set by the CLI to "round to the nearest number," and neither the CIL nor the class library provide a mechanism for modifying this setting. Conforming implementations of the CLI need not be resilient to external interference with this setting. That is, they need not restore the mode prior to performing floating-point operations, but rather, can rely on it having been set as part of their initialization.
For conversion to integers, the default operation supplied by the CIL is "truncate towards zero". Class libraries are supplied to allow floating-point numbers to be converted to integers using any of the other three traditional operations (round to nearest integer, floor (truncate towards –infinity), ceiling (truncate towards +infinity)).
Storage locations for floating-point numbers (statics, array elements, and fields of classes) are of fixed size. The supported storage sizes are float32
and float64
. Everywhere else (on the evaluation stack, as arguments, as return types, and as local variables) floating-point numbers are represented using an internal floating-point type. In each such instance, the nominal type of the variable or expression is either float32
or float64
, but its value can be represented internally with additional range and/or precision. The size of the internal floating-point representation is implementation-dependent, can vary, and shall have precision at least as great as that of the variable or expression being represented. An implicit widening conversion to the internal representation from float32
or float64
is performed when those types are loaded from storage. The internal representation is typically the native size for the hardware, or as required for efficient implementation of an operation. The internal representation shall have the following characteristics:
-
The internal representation shall have precision and range greater than or equal to the nominal type.
-
Conversions to and from the internal representation shall preserve value.
[Note: This implies that an implicit widening conversion from float32
(or float64
) to the internal representation, followed by an explicit conversion from the internal representation to float32
(or float64
), will result in a value that is identical to the original float32
(or float64
) value. end note]
[Rationale: This design allows the CLI to choose a platform-specific high-performance representation for floating-point numbers until they are placed in storage locations. For example, it might be able to leave floating-point variables in hardware registers that provide more precision than a user has requested. At the same time, CIL generators can force operations to respect language-specific rules for representations through the use of conversion instructions. end rationale]
When a floating-point value whose internal representation has greater range and/or precision than its nominal type is put in a storage location, it is automatically coerced to the type of the storage location. This can involve a loss of precision or the creation of an out-of-range value (NaN, +infinity, or -infinity). However, the value might be retained in the internal representation for future use, if it is reloaded from the storage location without having been modified. It is the responsibility of the compiler to ensure that the retained value is still valid at the time of a subsequent load, taking into account the effects of aliasing and other execution threads (see memory model (§I.12.6)). This freedom to carry extra precision is not permitted, however, following the execution of an explicit conversion (conv.r4
or conv.r8
), at which time the internal representation must be exactly representable in the associated type.
[Note: To detect values that cannot be converted to a particular storage type, a conversion instruction (conv.r4
, or conv.r8
) can be used, followed by a check for a non-finite value using ckfinite. Underflow can be detected by converting to a particular storage type, comparing to zero before and after the conversion. end note]
[Note: The use of an internal representation that is wider than float32
or float64
can cause differences in computational results when a developer makes seemingly unrelated modifications to their code, the result of which can be that a value is spilled from the internal representation (e.g., in a register) to a location on the stack. end note]