CUDA: fix numerical issue in tile FA kernel #16540

JohannesGaessler · 2025-10-12T19:34:46Z

Fixes issue described in #16528 (comment) .

The problem as far as I can tell are numerical issues for the rescaling of the VKQ accumulators with the inverse of the KQ sum at the end of the kernel. The input values in test-backend-ops and the models I tested did not provoke this issue so I did not detect it in #16492 . The fix is to simply use FP32 arithmetic, the impact on performance is negligible since it's only done once per CUDA block.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 12, 2025

JohannesGaessler mentioned this pull request Oct 12, 2025

graph : support cacheless embeddings with FA and iSWA #16528

Open

CUDA: fix numerical issue in tile FA kernel

ba62ea9

JohannesGaessler force-pushed the cuda-fa-fix-numerics branch from 525916a to ba62ea9 Compare October 12, 2025 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix numerical issue in tile FA kernel #16540

CUDA: fix numerical issue in tile FA kernel #16540

JohannesGaessler commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CUDA: fix numerical issue in tile FA kernel #16540

Are you sure you want to change the base?

CUDA: fix numerical issue in tile FA kernel #16540

Conversation

JohannesGaessler commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant