Skip to content

Conversation

@pxl-th
Copy link
Member

@pxl-th pxl-th commented Jan 2, 2026

For now CUDA.jl only. AMDGPU.jl will come next.

FP16:

  • before:
Flash attention FWD + BWD:
  114.087 ms (1142 allocations: 33.73 KiB)
  • now:
Flash attention FWD + BWD:
  83.020 ms (1142 allocations: 33.73 KiB)

@pxl-th
Copy link
Member Author

pxl-th commented Jan 2, 2026

Locally tests pass, CI is broken.

@pxl-th pxl-th merged commit b6ea23c into master Jan 2, 2026
1 check failed
@pxl-th pxl-th deleted the pxl-th/wmma branch January 2, 2026 17:22
@AntonOresten
Copy link
Contributor

Wow!! Terrific! Excellent work🙏

Would BFloat16 require JuliaGPU/CUDA.jl#1425?

@pxl-th
Copy link
Member Author

pxl-th commented Jan 2, 2026

Would BFloat16 require JuliaGPU/CUDA.jl#1425?

Most likely, I've tried it without those and it gave a bunch of errors.

@AntonOresten
Copy link
Contributor

AntonOresten commented Jan 3, 2026

JuliaGPU/CUDA.jl#3009 could be promising. Note that it requires Float32 accumulation.

Comparing ONIONop (no WMMA) to NNop master (with WMMA) with some buffer and type conversion to work around the accumulation:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants