Insure that std::float16_t/std::bfloat16_t support is exact #300

lemire · 2025-02-07T01:36:57Z

However issues remain:

At least one issue was identified. It is a minor issue so we still released, but it should be fixed. With float16_t, we have that he smallest value (subnormal) that can be represented using float16 is 2**-24 Consider 5.9604644775390625E-8 which is exactly 2**-25. This value is exactly midpoint between the float16 0 and the smallest float16 value. It should be zero (with rounding to even) but it is not. GCC shares this issue and it is quite minor. But there may be other similar issues and this requires investigation. A cause of the issue is that our subnormal code assumes that there cannot be short strings requiring round-to-even: and that is a mathematically proven assumption for 32-bit and 64-bit floats. However that is not true in general.
More generally, we need to go through both the float16_t and bfloat16_t and prove (mathematically) that all parameters are correct and optimal.

The text was updated successfully, but these errors were encountered:

Provide feedback