Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
b95e983
ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (ll…
mgiessing Aug 19, 2025
168ed5d
musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (l…
yeahdongcn Aug 19, 2025
55cf85f
CANN: optimize rope operator (llama/15335)
YangShuai52 Aug 19, 2025
90006fc
opencl: mark `argsort` unsupported if cols exceed workgroup limit (ll…
lhez Aug 19, 2025
d62570a
musa: fix build warnings (llama/15258)
yeahdongcn Aug 20, 2025
b70d703
vulkan: shorten pipeline name strings (llama/15431)
jeffbolznv Aug 20, 2025
29b7096
CUDA: replace GGML_CUDA_F16 with CUDA arch checks (llama/15433)
JohannesGaessler Aug 20, 2025
e6d142f
CUDA: refactor FA support/selection code (llama/15454)
JohannesGaessler Aug 20, 2025
d949a9b
sched : copy only the used experts when offloading prompt processing …
slaren Aug 20, 2025
44b2119
musa: add GGML_UNUSED_VARS (llama/15446)
yeahdongcn Aug 21, 2025
8132ca7
ggml : fix condition of im2col on Metal backend (llama/15460)
ngxson Aug 21, 2025
960f526
vulkan: Reuse conversion results in prealloc_y (llama/15410)
jeffbolznv Aug 21, 2025
783b3d7
vulkan: add exp operation (llama/15456)
ddwkim Aug 21, 2025
ac98307
vulkan : support conv_2d_dw with f16 weights (llama/15392)
Acly Aug 21, 2025
0c0e8ec
sched : fix possible use of wrong ids tensor when offloading moe prom…
slaren Aug 21, 2025
216254c
CANN: Optimize RMS_NORM using cache (llama/15419)
noemotiovon Aug 22, 2025
135ff30
ggml-cpu: Support Q5_0 and Q5_1 on s390x (llama/15486)
taronaeo Aug 22, 2025
07a7732
cuda : add Pad Reflect 1D support (llama/14659)
YavorGIvanov Aug 22, 2025
e18c249
ggml: add `conv3d` op (llama/15182)
rmatif Aug 22, 2025
294bc5f
ggml WebGPU: add support for quantization types (llama/15440)
reeselevine Aug 22, 2025
8998452
test-opt: allow slight inprecision (llama/15503)
JohannesGaessler Aug 22, 2025
b238604
vulkan: optimize mul_mat_id loading row ids into shared memory (llama…
jeffbolznv Aug 23, 2025
62e8d53
vulkan : support ggml_mean (llama/15393)
Acly Aug 23, 2025
0791daa
vulkan: Rewrite synchronization to allow some overlap between nodes (…
jeffbolznv Aug 23, 2025
4a0fa11
vulkan: optimize rms_norm, and allow the work to spread across multip…
jeffbolznv Aug 23, 2025
a61ecaa
CUDA: fix half2 -> half conversion for HIP (llama/15529)
JohannesGaessler Aug 23, 2025
051c0e7
vulkan: workaround MoltenVK compile failure in multi_add (llama/15506)
jeffbolznv Aug 24, 2025
0bbf8a2
vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (llama/1…
0cc4m Aug 24, 2025
11333c2
vulkan: Support FA with any multiple of 8 head sizes (llama/15537)
jeffbolznv Aug 24, 2025
b7719fd
vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices…
0cc4m Aug 24, 2025
5725598
CANN: ROPE cache sin/cos repeat (llama/15501)
noemotiovon Aug 25, 2025
1eb4bc7
metal : add FA kernels for HS=40 (llama/15559)
ggerganov Aug 25, 2025
06034b3
CUDA: MoE helper in device code, better tile sizes (llama/15525)
JohannesGaessler Aug 25, 2025
c1cd573
metal: fix regression when no metal devices are present (llama/15531)
booxter Aug 25, 2025
559d573
tests: Generate unique input values for count_equal (llama/15487)
jeffbolznv Aug 25, 2025
0a2544f
vulkan: fix min subgroup 16 condition for mmid subgroup optimization …
0cc4m Aug 25, 2025
d9f431b
opencl: fix support ops condition for `rms_norm` (llama/15560)
lhez Aug 25, 2025
8e80c1d
CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (llama/15451)
Qeeweew Aug 25, 2025
8d0b474
vulkan: Remove splitting for mul_mat_id (llama/15568)
jeffbolznv Aug 26, 2025
8f27427
Add a warning for special devices (llama/15563)
pt13762104 Aug 26, 2025
4cbe48d
metal : remove contiguous assertion for src0 in IM2COL (llama/15577)
CISC Aug 26, 2025
9c38cee
metal : improve `MUL_MAT_ID` (llama/15541)
ggerganov Aug 26, 2025
5179375
metal : optimize FA vec for large sequences and BS <= 8 (llama/15566)
ggerganov Aug 26, 2025
ba765b1
CUDA: return -1 for nonexistent compiled arch (llama/15587)
JohannesGaessler Aug 26, 2025
c64006b
llamafile: PowerPC Sgemm Optimization (llama/15558)
shalinib-ibm Aug 26, 2025
3eeaf08
tests: add performance test for mul mat id (llama/15543)
netrunnereve Aug 26, 2025
7436523
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (…
qnixsynapse Aug 26, 2025
de5c944
tests : fix test-opt with GGML_BACKEND_DL (llama/15599)
slaren Aug 26, 2025
1e09ac9
OpenCL: add fused group_norm/norm, mul, add (llama/15314)
rmatif Aug 27, 2025
d3cff36
ggml-cpu : add basic RVV support for vector f32 ops (llama/15057)
xctan Aug 27, 2025
ce3d179
CANN: refactor mask handling and improve performance in FA (llama/15561)
noemotiovon Aug 27, 2025
02f5fdf
HIP: Enable support for ggml_backend_cuda_register_host_buffer (llama…
IMbackK Aug 27, 2025
1a88e98
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (llama/…
matiaslin Aug 28, 2025
92b6797
kv-cache : remove LLAMA_SET_ROWS checks (llama/15505)
ggerganov Aug 28, 2025
e4556fe
ggml : fix SSM_SCAN for n_groups > 1 (llama/15625)
compilade Aug 28, 2025
741d8b7
ggml-cpu: fix invalid hsum build in debug s390x (llama/15634)
taronaeo Aug 28, 2025
70d536e
CUDA: add conv2d (llama/15635)
mnehete32 Aug 28, 2025
81415da
CUDA: fuse adds, fuse add with rms norm (llama/15631)
am17an Aug 29, 2025
e9c39f8
CUDA: fix bug in rms_norm fusion (llama/15660)
am17an Aug 29, 2025
2584f3b
CANN: FIx compiler warnings (llama/15661)
noemotiovon Aug 30, 2025
5c41cf0
vulkan: Skip syncing for prealloc_y when it is reused (llama/15544)
jeffbolznv Aug 30, 2025
0e4d68d
CUDA: use FP32 arithmetic for conv2d (llama/15683)
JohannesGaessler Aug 30, 2025
d0a45a6
llama: use FA + max. GPU layers by default (llama/15434)
JohannesGaessler Aug 30, 2025
49135c8
ggml: update kleidiai to v1.13.0 (llama/15663)
chaxu01 Aug 30, 2025
bb60a58
vulkan: clamp matmul and FA results to the max finite value (llama/15…
jeffbolznv Aug 31, 2025
4d0ed4f
vulkan: Allow fallback to sysmem memory when vidmem is full (llama/15…
jeffbolznv Aug 31, 2025
d58ab24
vulkan : remove unused portability_enumeration_ext variable (llama/15…
danbev Aug 31, 2025
26580b6
vulkan: mul_mat_id coopmat2 optimizations (llama/15546)
jeffbolznv Aug 31, 2025
cd2cdfd
vulkan: handle large sizes for get_rows (llama/15686)
jeffbolznv Aug 31, 2025
319bf93
llama : separate compute buffer reserve from fattn check (llama/15696)
slaren Aug 31, 2025
bcfb7b4
metal : fix checks for available FA kernels (llama/15700)
ggerganov Aug 31, 2025
79bfa5a
CANN: fix RoPE cache issue on multi-device (llama/15629)
hipudding Sep 1, 2025
886e6e6
CANN: Optimize MUL_MAT_ID (llama/15658)
hipudding Sep 1, 2025
a2a9beb
CUDA: fix build error from ambiguous __half conversions in conv2d (ll…
qnixsynapse Sep 1, 2025
eb50f30
ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops (llama/15695)
danbev Sep 1, 2025
68fd8cc
Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants …
0cc4m Sep 1, 2025
e260ebe
ggml: aarch64: Implement SVE F16 kernels for vector functions (llama/…
Vithulep Sep 1, 2025
bfd4fe3
ggml: SVE support for exponential functions (llama/15145)
s-goto-11 Sep 1, 2025
9c2f7b1
vulkan: disable large mmv subgroups on older Nvidia GPUs (llama/15717)
0cc4m Sep 1, 2025
2d57866
vulkan: add missing clamps in new mul_mat_id paths (llama/15702)
jeffbolznv Sep 1, 2025
cc62cd9
vulkan: use memory budget extension to read memory usage (llama/15545)
giladgd Sep 1, 2025
905a1e1
ggml-backend: raise GGML_MAX_SPLIT_INPUTS (llama/15722)
JohannesGaessler Sep 1, 2025
3664722
CANN: Support ext_factor in rope (llama/15710)
hipudding Sep 2, 2025
86c03b3
CANN: Support eager execution mode under ACL graph compilation (llama…
noemotiovon Sep 2, 2025
7beb9ad
opencl: add attn sinks support for FA kernels (llama/15706)
rmatif Sep 2, 2025
3f53dda
vulkan: Fix macro parameter order for f32 matmul shaders (llama/15716)
jeffbolznv Sep 2, 2025
0db3a9b
CANN: Resolve soft_max precision issue (llama/15730)
hipudding Sep 2, 2025
f71dd1f
vulkan: fix shaders gen when no integer dot is available (llama/15740)
0cc4m Sep 2, 2025
2d1041a
CANN: Fix type float_t to float (llama/15736)
noemotiovon Sep 3, 2025
0ba2424
CANN: Mask unsupported TRANSPOSE_1D operator (llama/15733)
hipudding Sep 3, 2025
608cdc0
ggml-cpu : optimize RVV kernels (llama/15720)
xctan Sep 3, 2025
ae537b1
CANN: Add RoPE contiguous check for 310I DUP device (llama/15735)
hipudding Sep 3, 2025
5845048
CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1…
ORippler Sep 3, 2025
7a27944
ggml vulkan: add hardsigmoid and hardswish operations (llama/15762)
relent95 Sep 3, 2025
cd5455e
vulkan : update ggml_vk_instance_validation_ext_available (llama/15666)
danbev Sep 3, 2025
2217478
vulkan: don't use std::string in load_shaders, to improve compile tim…
jeffbolznv Sep 3, 2025
2600727
vulkan: fix mmv subgroup16 selection (llama/15775)
0cc4m Sep 3, 2025
fd4a465
CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (llama/15760)
noemotiovon Sep 4, 2025
7d7e9a1
opencl: add hs=40 to FA (llama/15758)
rmatif Sep 4, 2025
3bc53e5
CANN: Fix precision issue on 310I DUO multi-devices (llama/15784)
hipudding Sep 4, 2025
0e2040f
ggml: add ops for WAN video model (cuda && cpu) (llama/15669)
leejet Sep 4, 2025
2df7473
CANN: Refactor ND to NZ workspace to be per-device (llama/15763)
noemotiovon Sep 4, 2025
5cee58c
metal : Add template specialization for mul_mm_id w/ ne20 == 10 (llam…
gabe-l-hart Sep 4, 2025
5fdc78f
sync : llama.cpp
ggerganov Sep 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
cmake_minimum_required(VERSION 3.14) # for add_link_options and implicit target directories.
project("ggml" C CXX)
project("ggml" C CXX ASM)
include(CheckIncludeFileCXX)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
Expand Down Expand Up @@ -129,7 +129,9 @@ endif()
option(GGML_LASX "ggml: enable lasx" ON)
option(GGML_LSX "ggml: enable lsx" ON)
option(GGML_RVV "ggml: enable rvv" ON)
option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
option(GGML_RV_ZFH "ggml: enable riscv zfh" ON)
option(GGML_RV_ZVFH "ggml: enable riscv zvfh" ON)
option(GGML_RV_ZICBOP "ggml: enable riscv zicbop" ON)
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
option(GGML_VXE "ggml: enable vxe" ON)
option(GGML_NNPA "ggml: enable nnpa" OFF) # temp disabled by default, see: https://github.com/ggml-org/llama.cpp/issues/14877
Expand Down Expand Up @@ -158,7 +160,6 @@ option(GGML_CUDA "ggml: use CUDA"
option(GGML_MUSA "ggml: use MUSA" OFF)
option(GGML_CUDA_FORCE_MMQ "ggml: use mmq kernels instead of cuBLAS" OFF)
option(GGML_CUDA_FORCE_CUBLAS "ggml: always use cuBLAS instead of mmq kernels" OFF)
option(GGML_CUDA_F16 "ggml: use 16 bit floats for some calculations" OFF)
set (GGML_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
"ggml: max. batch size for using peer access")
option(GGML_CUDA_NO_PEER_COPY "ggml: do not use peer to peer copies" OFF)
Expand Down
3 changes: 3 additions & 0 deletions include/ggml-backend.h
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,9 @@ extern "C" {
GGML_API void ggml_backend_sched_set_tensor_backend(ggml_backend_sched_t sched, struct ggml_tensor * node, ggml_backend_t backend);
GGML_API ggml_backend_t ggml_backend_sched_get_tensor_backend(ggml_backend_sched_t sched, struct ggml_tensor * node);

// Split graph without allocating it
GGML_API void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct ggml_cgraph * graph);

// Allocate and compute graph on the backend scheduler
GGML_API bool ggml_backend_sched_alloc_graph(ggml_backend_sched_t sched, struct ggml_cgraph * graph); // returns success
GGML_API enum ggml_status ggml_backend_sched_graph_compute(ggml_backend_sched_t sched, struct ggml_cgraph * graph);
Expand Down
74 changes: 74 additions & 0 deletions include/ggml.h
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,13 @@
#define GGML_MROPE_SECTIONS 4

#define GGML_UNUSED(x) (void)(x)
#ifdef __CUDACC__
template<typename... Args>
__host__ __device__ constexpr inline void ggml_unused_vars_impl(Args&&...) noexcept {}
#define GGML_UNUSED_VARS(...) ggml_unused_vars_impl(__VA_ARGS__)
#else
#define GGML_UNUSED_VARS(...) do { (void)sizeof((__VA_ARGS__, 0)); } while(0)
#endif // __CUDACC__

#define GGML_PAD(x, n) (((x) + (n) - 1) & ~((n) - 1))

Expand Down Expand Up @@ -504,7 +511,9 @@ extern "C" {
GGML_OP_CONV_TRANSPOSE_1D,
GGML_OP_IM2COL,
GGML_OP_IM2COL_BACK,
GGML_OP_IM2COL_3D,
GGML_OP_CONV_2D,
GGML_OP_CONV_3D,
GGML_OP_CONV_2D_DW,
GGML_OP_CONV_TRANSPOSE_2D,
GGML_OP_POOL_1D,
Expand Down Expand Up @@ -1862,6 +1871,41 @@ extern "C" {
int d0, // dilation dimension 0
int d1); // dilation dimension 1

GGML_API struct ggml_tensor * ggml_im2col_3d(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int64_t IC,
int s0, // stride width
int s1, // stride height
int s2, // stride depth
int p0, // padding width
int p1, // padding height
int p2, // padding depth
int d0, // dilation width
int d1, // dilation height
int d2, // dilation depth
enum ggml_type dst_type);

// a: [OC*IC, KD, KH, KW]
// b: [N*IC, ID, IH, IW]
// result: [N*OC, OD, OH, OW]
GGML_API struct ggml_tensor * ggml_conv_3d(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int64_t IC,
int s0, // stride width
int s1, // stride height
int s2, // stride depth
int p0, // padding width
int p1, // padding height
int p2, // padding depth
int d0, // dilation width
int d1, // dilation height
int d2 // dilation depth
);

// kernel size is a->ne[0] x a->ne[1]
// stride is equal to kernel size
// padding is zero
Expand Down Expand Up @@ -1933,6 +1977,23 @@ extern "C" {
int d0, // dilation dimension 0
int d1); // dilation dimension 1

GGML_API struct ggml_tensor * ggml_conv_3d_direct(
struct ggml_context * ctx,
struct ggml_tensor * a, // kernel [KW, KH, KD, IC * OC]
struct ggml_tensor * b, // input [W, H, D, C * N]
int s0, // stride
int s1,
int s2,
int p0, // padding
int p1,
int p2,
int d0, // dilation
int d1,
int d2,
int n_channels,
int n_batch,
int n_channels_out);

enum ggml_op_pool {
GGML_OP_POOL_MAX,
GGML_OP_POOL_AVG,
Expand Down Expand Up @@ -2023,6 +2084,19 @@ extern "C" {
int p2,
int p3);

GGML_API struct ggml_tensor * ggml_pad_ext(
struct ggml_context * ctx,
struct ggml_tensor * a,
int lp0,
int rp0,
int lp1,
int rp1,
int lp2,
int rp2,
int lp3,
int rp3
);

// pad each dimension with reflection: [a, b, c, d] -> [b, a, b, c, d, c]
GGML_API struct ggml_tensor * ggml_pad_reflect_1d(
struct ggml_context * ctx,
Expand Down
2 changes: 1 addition & 1 deletion scripts/sync-llama.last
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3007baf201e7ffcda17dbdb0335997fa50a6595b
a81283820a466f2ace06ce4d4bc9512761f9365f
Loading
Loading