sync : llama.cpp #1338

ggerganov · 2025-09-05T09:55:17Z

No description provided.

…ama/15385) * Added VSX intrinsics for Power9+ systems Signed-off-by: mgiessing <[email protected]> * Manual unrolling for minor perf improvement Signed-off-by: mgiessing <[email protected]> * Update src/ggml-cpu/arch/powerpc/quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Signed-off-by: mgiessing <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…lama/15413) Signed-off-by: Xiaodong Ye <[email protected]>

* optimize rope ops * amendment * delete trailing whitespace * change the variable name

…ama/15375)

* musa: fix build warnings Signed-off-by: Xiaodong Ye <[email protected]> * fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare] Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

These detailed strings were causing increased build time on gcc.

…(llama/15346)

Signed-off-by: Xiaodong Ye <[email protected]>

* vulkan: Reuse conversion results in prealloc_y Cache the pipeline and tensor that were most recently used to fill prealloc_y, and skip the conversion if the current pipeline/tensor match. * don't use shared pointer for prealloc_y_last_pipeline_used

Co-authored-by: aeseulgi <[email protected]>

…pt processing (llama/15488)

* [CANN] Optimize RMS_NORM using cache Signed-off-by: noemotiovon <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> * fix review comment Signed-off-by: noemotiovon <[email protected]> * codestyle adjustment Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

* ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <[email protected]> * docs: update the last update date Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

* Add Pad Reflect 1D CUDA support * Update src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

* add conv3d * bump GGML_OP_COUNT

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView

…/15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

* vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader

…llama/15489) Track a list of nodes that need synchronization, and only sync if the new node depends on them (or overwrites them). This allows some overlap which can improve performance, and centralizes a big chunk of the synchronization logic. The remaining synchronization logic involves writes to memory other than the nodes, e.g. for dequantization or split_k. Each of these allocations has a bool indicating whether they were in use and need to be synced. This should be checked before they are written to, and set to true after they are done being consumed.

…le SMs (llama/15281) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against #15489, sync after clearing partial sums

* vulkan: workaround MoltenVK compile failure in multi_add * Update src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

…5526)

The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.

… (llama/15524) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16

* SVE support for exponential functions Add const notation to variable pg * Update src/ggml-cpu/vec.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Add const --------- Co-authored-by: Georgi Gerganov <[email protected]>

This is a missing interaction between #15546 and #15652

* vulkan: use memory budget extension to read memory usage * fix: formatting and names * formatting * fix: detect and cache memory budget extension availability on init * fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available * style: lints

…/15712) * [CANN] Support eager execution mode under ACL graph compilation Add support for running operators in eager mode while ACL graph compilation is enabled. This allows bypassing graph execution and directly submitting ops, which is useful for debugging and reducing graph build overhead in certain scenarios. Signed-off-by: noemotiovon <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> * rename to acl_graph_mode Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

Previously, the slope tensor was set to fp16 to improve efficiency. While this worked correctly in FA, it caused precision issues in soft_max. This change applies different data types for different operators to balance both accuracy and performance.

Signed-off-by: noemotiovon <[email protected]>

CANN currently does not support kernels larger than 255. This change disables such cases.

* ggml-cpu : optimize rvv ggml_vec_dot_f32 * ggml-cpu : optimize 128-bit rvv ggml_vec_dot_q4_K_q8_K * ggml-cpu : fix riscv arch flags * ggml-cpu : add more rvv ops * ggml-cpu : optimize rvv ggml_vec_dot_q4_K_q8_K * ggml-cpu : optimize rvv ggml_vec_dot_q6_K_q8_K * ggml-cpu : minor rvv adjustments * ggml-cpu : fix riscv include

…-6% perf E2E (llama/15715) * Add fastdiv, use it in modulo and use modulo in rms_norm_f32 Fastdiv is much faster way to do integer division, which was identified as bottleneck in rms_norm_f32 * Support more `block_size` values in `rms_norm_f32` This makes us more flexible in selecting the optimal threads w.r.t paralellizing across a col vs. launch-overheads of threads and mio throttles * Update src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <[email protected]> * Replace modulo with fastmodulo in `rms_norm_f32` * Use `BinPackArguments=true` for formating function calls Will file a separate PR to adjust .clang-format file * Update src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <[email protected]> * Use uint3 for both `fastdiv` and `fastmodulo` The compiler seems to reliably optimize away the unused .z component in the fastdiv use-case, see https://godbolt.org/z/rx8KPrKr3 * More constrained type declarations Co-authored-by: Johannes Gäßler <[email protected]> * Rename fastdiv and fastmodulo variables to shared variable name As suggest by JohannesGaessler, this increases clarity of the intended use * Pack fastdiv/fastmodulo constants into uint2/uint3 objects By packing constants to be used together into a struct, we are less likely to make errors. * Rename function parameter of fastmodulo `modulo_consts` is more fitting/descriptive --------- Co-authored-by: Johannes Gäßler <[email protected]>

* vulkan : update ggml_vk_instance_validation_ext_available This commit updates ggml_vk_instance_validation_ext_available() to check for VK_EXT_validation_features instead of VK_KHR_portability_enumeration. Based on how the returned boolean is used later in the code (to enable both the validation layer and the VK_EXT_validation_features extension), it appears the function may have been intended to check for the validation layer features extension. * remove try/catch This was a left over from a previous iteration where I was explicitly quering for a specific validation layer first, which would throw. * update warning message about validation layers

…e (llama/15724) * vulkan: don't use std::string in load_shaders, to improve compile time * keep the string version for those calls that use it

Fixes #15330 Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation. Co-authored-by: Yuchuan <[email protected]>

* add conv3d support * add ggml_pad_ext for cpu & cuda backend * cuda/cpu: add im2col_3d support * cuda: make im2col a little faster * fix cuda pad/scale/im2col3d * make im2col_3d faster * gguf: support loading tensors which n_dims > GGML_MAX_DIMS * fix cuda get_rows * avoid ggml_conv_3d conflict * correct GGML_OP_COUNT assertion * avoid build failure * avoid build failure on MacOS * cuda: remove unnecessary MIN define * fix cpu im2col_3d * adjust the code style * cuda: use simpler loop in get_rows * add test_im2col_3d to test-backend-ops * test-backend-ops.cpp: remove trailing whitespace * cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <[email protected]> * fix test_im2col_3d * remove unused variables * cuda: get_rows: dfloat2 -> float2 * add test_pad_ext to test-backend-ops.cpp * add gguf_init_from_file_ext impl * Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS" This reverts commit d8377a0a37f314bd3713fe043b4333ad661610c1. * Revert "add gguf_init_from_file_ext impl" This reverts commit d9f1d13208c68ef83b3538201ac7f31614fb1994. * update ggml_backend_vk_device_supports_op * fix ggml_backend_vk_device_supports_op * update other backend supports op for ggml_pad_ext * metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op --------- Co-authored-by: Jeff Bolz <[email protected]>

* CANN:Refactor ND to NZ workspace to be per-device in Ascend backend - Replaced the previous single global ND→NZ workspace with a per-device cache using unordered_map keyed by device ID. - Functions `release_nz_workspace`, `relloc_nz_workspace`, and `get_nz_workspace` now manage workspace independently for each device, preventing memory conflicts in multi-device / pipeline parallel scenarios. - This change fixes potential precision issues caused by workspace overwrites when multiple devices perform ND→NZ conversions concurrently. Co-authored-by: hipudding <[email protected]> * refactor Signed-off-by: noemotiovon <[email protected]> * rename Signed-off-by: noemotiovon <[email protected]> * fix review comments Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]> Co-authored-by: hipudding <[email protected]>

…a/15799) Branch: GGMLMetalNE20 Signed-off-by: Gabe Goodhart <[email protected]>

ggml-ci

ggerganov · 2025-09-05T10:30:33Z

Hm, this error has never occurred before when rebase + merge:

mgiessing and others added 30 commits September 5, 2025 12:53

musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (l…

168ed5d

…lama/15413) Signed-off-by: Xiaodong Ye <[email protected]>

CANN: optimize rope operator (llama/15335)

55cf85f

* optimize rope ops * amendment * delete trailing whitespace * change the variable name

opencl: mark argsort unsupported if cols exceed workgroup limit (ll…

90006fc

…ama/15375)

vulkan: shorten pipeline name strings (llama/15431)

b70d703

These detailed strings were causing increased build time on gcc.

CUDA: replace GGML_CUDA_F16 with CUDA arch checks (llama/15433)

29b7096

CUDA: refactor FA support/selection code (llama/15454)

e6d142f

sched : copy only the used experts when offloading prompt processing …

d949a9b

…(llama/15346)

musa: add GGML_UNUSED_VARS (llama/15446)

44b2119

Signed-off-by: Xiaodong Ye <[email protected]>

ggml : fix condition of im2col on Metal backend (llama/15460)

8132ca7

vulkan: add exp operation (llama/15456)

783b3d7

Co-authored-by: aeseulgi <[email protected]>

vulkan : support conv_2d_dw with f16 weights (llama/15392)

ac98307

sched : fix possible use of wrong ids tensor when offloading moe prom…

0c0e8ec

…pt processing (llama/15488)

cuda : add Pad Reflect 1D support (llama/14659)

07a7732

* Add Pad Reflect 1D CUDA support * Update src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

ggml: add conv3d op (llama/15182)

e18c249

* add conv3d * bump GGML_OP_COUNT

test-opt: allow slight inprecision (llama/15503)

8998452

vulkan: optimize mul_mat_id loading row ids into shared memory (llama…

b238604

…/15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

CUDA: fix half2 -> half conversion for HIP (llama/15529)

a61ecaa

vulkan: workaround MoltenVK compile failure in multi_add (llama/15506)

051c0e7

* vulkan: workaround MoltenVK compile failure in multi_add * Update src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (llama/1…

0bbf8a2

…5526)

s-goto-11 and others added 27 commits September 5, 2025 12:54

vulkan: disable large mmv subgroups on older Nvidia GPUs (llama/15717)

9c2f7b1

vulkan: add missing clamps in new mul_mat_id paths (llama/15702)

2d57866

This is a missing interaction between #15546 and #15652

ggml-backend: raise GGML_MAX_SPLIT_INPUTS (llama/15722)

905a1e1

CANN: Support ext_factor in rope (llama/15710)

3664722

opencl: add attn sinks support for FA kernels (llama/15706)

7beb9ad

vulkan: Fix macro parameter order for f32 matmul shaders (llama/15716)

3f53dda

vulkan: fix shaders gen when no integer dot is available (llama/15740)

f71dd1f

CANN: Fix type float_t to float (llama/15736)

2d1041a

Signed-off-by: noemotiovon <[email protected]>

CANN: Mask unsupported TRANSPOSE_1D operator (llama/15733)

0ba2424

CANN currently does not support kernels larger than 255. This change disables such cases.

CANN: Add RoPE contiguous check for 310I DUP device (llama/15735)

ae537b1

ggml vulkan: add hardsigmoid and hardswish operations (llama/15762)

7a27944

vulkan: don't use std::string in load_shaders, to improve compile tim…

2217478

…e (llama/15724) * vulkan: don't use std::string in load_shaders, to improve compile time * keep the string version for those calls that use it

vulkan: fix mmv subgroup16 selection (llama/15775)

2600727

CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (llama/15760)

fd4a465

Fixes #15330 Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation. Co-authored-by: Yuchuan <[email protected]>

opencl: add hs=40 to FA (llama/15758)

7d7e9a1

CANN: Fix precision issue on 310I DUO multi-devices (llama/15784)

3bc53e5

metal : Add template specialization for mul_mm_id w/ ne20 == 10 (llam…

5cee58c

…a/15799) Branch: GGMLMetalNE20 Signed-off-by: Gabe Goodhart <[email protected]>

sync : llama.cpp

5fdc78f

ggml-ci

ggerganov merged commit 5fdc78f into master Sep 5, 2025
6 of 15 checks passed

ggerganov deleted the sync-llama.cpp-25-09-05 branch September 5, 2025 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : llama.cpp #1338

sync : llama.cpp #1338

Uh oh!

ggerganov commented Sep 5, 2025

Uh oh!

ggerganov commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

sync : llama.cpp #1338

sync : llama.cpp #1338

Uh oh!

Conversation

ggerganov commented Sep 5, 2025

Uh oh!

ggerganov commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!