You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At least one issue was identified. It is a minor issue so we still released, but it should be fixed. With float16_t, we have that he smallest value (subnormal) that can be represented using float16 is 2**-24 Consider 5.9604644775390625E-8 which is exactly 2**-25. This value is exactly midpoint between the float16 0 and the smallest float16 value. It should be zero (with rounding to even) but it is not. GCC shares this issue and it is quite minor. But there may be other similar issues and this requires investigation. A cause of the issue is that our subnormal code assumes that there cannot be short strings requiring round-to-even: and that is a mathematically proven assumption for 32-bit and 64-bit floats. However that is not true in general.
More generally, we need to go through both the float16_t and bfloat16_t and prove (mathematically) that all parameters are correct and optimal.
The text was updated successfully, but these errors were encountered:
In release 8.0.0, we support
float16_t
andbfloat16_t
(thanks @dalle). We have reasonable testing and our code is based on an implementation publicly available since GCC 13 (thanks @jakubjelinek for providing support).However issues remain:
float16_t
, we have that he smallest value (subnormal) that can be represented using float16 is 2**-24 Consider 5.9604644775390625E-8 which is exactly 2**-25. This value is exactly midpoint between the float16 0 and the smallest float16 value. It should be zero (with rounding to even) but it is not. GCC shares this issue and it is quite minor. But there may be other similar issues and this requires investigation. A cause of the issue is that our subnormal code assumes that there cannot be short strings requiring round-to-even: and that is a mathematically proven assumption for 32-bit and 64-bit floats. However that is not true in general.float16_t
andbfloat16_t
and prove (mathematically) that all parameters are correct and optimal.The text was updated successfully, but these errors were encountered: