Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insure that std::float16_t/std::bfloat16_t support is exact #300

Open
lemire opened this issue Feb 7, 2025 · 0 comments
Open

Insure that std::float16_t/std::bfloat16_t support is exact #300

lemire opened this issue Feb 7, 2025 · 0 comments

Comments

@lemire
Copy link
Member

lemire commented Feb 7, 2025

In release 8.0.0, we support float16_t and bfloat16_t (thanks @dalle). We have reasonable testing and our code is based on an implementation publicly available since GCC 13 (thanks @jakubjelinek for providing support).

However issues remain:

  1. At least one issue was identified. It is a minor issue so we still released, but it should be fixed. With float16_t, we have that he smallest value (subnormal) that can be represented using float16 is 2**-24 Consider 5.9604644775390625E-8 which is exactly 2**-25. This value is exactly midpoint between the float16 0 and the smallest float16 value. It should be zero (with rounding to even) but it is not. GCC shares this issue and it is quite minor. But there may be other similar issues and this requires investigation. A cause of the issue is that our subnormal code assumes that there cannot be short strings requiring round-to-even: and that is a mathematically proven assumption for 32-bit and 64-bit floats. However that is not true in general.
  2. More generally, we need to go through both the float16_t and bfloat16_t and prove (mathematically) that all parameters are correct and optimal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant