Skip to content

feat(amd): first-class R9700 (gfx1201/RDNA4) support for Qwen3.6-27B#435

Open
DeanoC wants to merge 2 commits into
Luce-Org:mainfrom
GeometricAGI:feat/r9700-gfx1201
Open

feat(amd): first-class R9700 (gfx1201/RDNA4) support for Qwen3.6-27B#435
DeanoC wants to merge 2 commits into
Luce-Org:mainfrom
GeometricAGI:feat/r9700-gfx1201

Conversation

@DeanoC

@DeanoC DeanoC commented Jun 22, 2026

Copy link
Copy Markdown

Summary

Makes the AMD Radeon AI PRO R9700 (gfx1201, RDNA4 / Navi 48) a first-class AMD target for the DFlash Qwen3.6-27B stack, and fixes a CMake-version build break that affects the cuda12 / rocm Docker images.

1. R9700 / gfx1201 support

The HIP build pipeline compiled RDNA4 as gfx1200 (Navi 44, RX 9060-class) and mislabeled it "RX 9070". The R9700 is gfx1201, which is not code-object compatible with gfx1200 — so the published :rocm image and the documented build produced no native R9700 kernels (ggml-hip fails to load / falls back badly on the R9700).

  • Add gfx1201 to the default fat-binary HIP arch list in docker-bake.hcl, Dockerfile.rocm, and the CI main/release matrix (.github/workflows/docker.yml). PR builds stay gfx1151-only for fast CI. Corrected the RDNA4 labels (gfx1200 = RX 9060, gfx1201 = RX 9070 / R9700).
  • Build the rocWMMA flashprefill numerics test (test_flashprefill_kernels) under HIP too — it was CUDA-only, so there was previously no way to validate the rocWMMA path on AMD. The CUDA-spelled test compiles via the existing hip_compat/ shim.
  • Document the R9700 in the README "Tested Machines" table and the server/README.md AMD HIP section (gfx1201 build, --ddtree-budget=22, multi-GPU / Fedora-PIE build notes).

No kernel changes: ROCm 7.1's rocWMMA handles the gfx12 WMMA operand-format change internally.

2. Build fix: json FetchContent on CMake < 3.24

#433 added DOWNLOAD_EXTRACT_TIMESTAMP TRUE to the json FetchContent_Declare. That keyword (and policy CMP0135) only exist on CMake ≥ 3.24, but the cuda12 / rocm Docker base images ship CMake 3.22, where the unknown token is parsed as extra URL list entries and configure fails with At least one entry of URL is a path (invalid in a list). The option is now applied only on CMake ≥ 3.24 (where it still silences the dev warning); older CMake parses the declare correctly again. Reproduced and verified on CMake 3.22.6 and 4.3.

Validation — real R9700 (gfx1201, ROCm 7.1.1)

Native gfx1201 build (Phase 1 + Phase 2 rocWMMA); the built libggml-hip.so contains native hipv4-amdgcn-amd-amdhsa--gfx1201 code objects.

Check Result
test_server_unit 1984 assertions, 0 failures
test_flashprefill_kernels (rocWMMA) PASS — max diff 5e-4; e2e flash_prefill_forward_bf16 @ S=8192 = 10.7 ms/iter
Qwen3.6-27B Q4_K_M + DFlash, --ddtree-budget=22 54.65 tok/s mean decode (bench_he.py --n-gen 256, AL 7.14, range 36.9–93.0 tok/s)
16K-context generation coherent

Notes

  • Fedora's system ROCm lives under /usr and links PIE by default; the local build used -DCMAKE_HIP_COMPILER_ROCM_ROOT=/usr -DROCM_PATH=/usr -DCMAKE_EXE_LINKER_FLAGS=-no-pie (documented in server/README.md). The Ubuntu-based Dockerfile.rocm is unaffected.

Review in cubic

DeanoC and others added 2 commits June 22, 2026 18:55
The HIP build pipeline compiled RDNA4 as gfx1200 (Navi 44, RX 9060) and
mislabeled it 'RX 9070'. The Radeon AI PRO R9700 is gfx1201 (Navi 48), which
is NOT code-object compatible with gfx1200 — so the published :rocm image and
the documented build shipped no native R9700 kernels.

- Add gfx1201 to the default fat-binary HIP arch list (docker-bake.hcl,
  Dockerfile.rocm, CI main/release matrix) and correct the RDNA4 labels.
- Build the rocWMMA flashprefill numerics test (test_flashprefill_kernels)
  under HIP too, not just CUDA, so the Phase 2 path can be validated on AMD.
- Document the R9700: gfx1201 build, --ddtree-budget=22, the multi-GPU /
  Fedora-PIE build notes, and benchmark numbers.

Validated on a real R9700 (gfx1201, ROCm 7.1.1):
- test_server_unit: 1984 assertions, 0 failures
- test_flashprefill_kernels (rocWMMA): PASS, max diff 5e-4
- Qwen3.6-27B Q4_K_M + DFlash, budget=22: 54.65 tok/s mean decode (AL 7.14)
- coherent 16K-context generation

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The json FetchContent_Declare gained DOWNLOAD_EXTRACT_TIMESTAMP TRUE in Luce-Org#433
to silence the CMP0135 dev warning. That keyword (and CMP0135) only exist on
CMake >= 3.24; on the 3.22 base used by the cuda12/rocm Docker builds the
unknown token is parsed as extra URL list entries, failing configure with
'At least one entry of URL is a path (invalid in a list)'. This broke the
Docker prebuilds on main, independent of the R9700 work.

Apply the option only on CMake >= 3.24 (where it still silences the warning);
older CMake parses the declare correctly again. Reproduced and verified on
CMake 3.22.6 and 4.3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant