Closed
Description
Name and Version
Current master, 80a02aa
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 4850 (ea002810)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
Operating systems
Linux
GGML backends
CPU, CUDA
Hardware
NVIDIA RTX 3060 12 GB VRAM and an AMD Ryzen 9 7900
Models
Mistral Small 3, quantized to Q4_K_M
Problem description & steps to reproduce
llama-server segfaults in ggml_compute_forward_dup_same_cont
(__memcpy_avx512_unaligned_erms
) after a couple of concurrent inputs, when --parallel 4
is passed. This does not happen when parallel processing is disabled (remove the flag)
First Bad Commit
No response
Relevant log output
$ gdb --args ./build/bin/llama-server -m /ollama/data/ollama/models/blobs/sha256-102a747c137683e81d431dab05d8f2158df4ab6f162f8f9019425a43d51e0e9f --port 8080 -ngl 30 --temp 0.15 -c 20000 -ctk q4_0 -ctv q4_0 -t 12 --batch-size 512 -fa --grammar-file grammar.gbnf --n-predict 100 --no-context-shift --parallel 4
__memcpy_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:461
461 VMOVU -VEC_SIZE(%rsi, %rdx), %VMM(5)
(gdb) bt
#0 __memcpy_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:461
#1 0x00007ffff78ef70e in ggml_compute_forward_dup_same_cont (params=0x7ffefffd77b0, dst=0x55555a175410)
at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3120
#2 0x00007ffff78f40d8 in ggml_compute_forward_dup_bytes (params=0x7ffefffd77b0, dst=0x55555a175410)
at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:4067
#3 0x00007ffff78f501e in ggml_compute_forward_dup (params=0x7ffefffd77b0, dst=0x55555a175410)
at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:4260
#4 0x00007ffff790c3e8 in ggml_compute_forward_cpy (params=0x7ffefffd77b0, dst=0x55555a175410)
at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:9611
#5 0x00007ffff791f93a in ggml_compute_forward (params=0x7ffefffd77b0, tensor=0x55555a175410)
at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:14195
#6 0x00007ffff79215d8 in ggml_graph_compute_thread (data=0x55555d8445f0)
at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:15203
#7 0x00007ffff7921f0c in ggml_graph_compute._omp_fn.0(void) ()
at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:15478
#8 0x00007ffff7f0b637 in gomp_thread_start (xdata=<optimized out>) at /usr/src/debug/gcc/gcc/libgomp/team.c:129
#9 0x00007ffff40a370a in start_thread (arg=<optimized out>) at pthread_create.c:448
#10 0x00007ffff4127aac in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
The lines:
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c
Lines 3120 to 3123 in 80a02aa