Skip to content

Eval bug: Segfault in ggml_compute_forward_dup_bytes #12354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cpg314 opened this issue Mar 12, 2025 · 3 comments
Open

Eval bug: Segfault in ggml_compute_forward_dup_bytes #12354

cpg314 opened this issue Mar 12, 2025 · 3 comments

Comments

@cpg314
Copy link

cpg314 commented Mar 12, 2025

Name and Version

Current master, 80a02aa

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 4850 (ea002810)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

CPU, CUDA

Hardware

NVIDIA RTX 3060 12 GB VRAM and an AMD Ryzen 9 7900

Models

Mistral Small 3, quantized to Q4_K_M

Problem description & steps to reproduce

llama-server segfaults in ggml_compute_forward_dup_same_cont (__memcpy_avx512_unaligned_erms) after a couple of concurrent inputs, when --parallel 4 is passed. This does not happen when parallel processing is disabled (remove the flag)

First Bad Commit

No response

Relevant log output

$ gdb --args ./build/bin/llama-server -m /ollama/data/ollama/models/blobs/sha256-102a747c137683e81d431dab05d8f2158df4ab6f162f8f9019425a43d51e0e9f --port 8080 -ngl 30  --temp 0.15 -c 20000  -ctk q4_0 -ctv q4_0 -t 12 --batch-size 512 -fa --grammar-file grammar.gbnf --n-predict 100 --no-context-shift  --parallel 4
__memcpy_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:461
461             VMOVU   -VEC_SIZE(%rsi, %rdx), %VMM(5)
(gdb) bt
#0  __memcpy_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:461
#1  0x00007ffff78ef70e in ggml_compute_forward_dup_same_cont (params=0x7ffefffd77b0, dst=0x55555a175410)
    at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3120
#2  0x00007ffff78f40d8 in ggml_compute_forward_dup_bytes (params=0x7ffefffd77b0, dst=0x55555a175410)
    at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:4067
#3  0x00007ffff78f501e in ggml_compute_forward_dup (params=0x7ffefffd77b0, dst=0x55555a175410)
    at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:4260
#4  0x00007ffff790c3e8 in ggml_compute_forward_cpy (params=0x7ffefffd77b0, dst=0x55555a175410)
    at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:9611
#5  0x00007ffff791f93a in ggml_compute_forward (params=0x7ffefffd77b0, tensor=0x55555a175410)
    at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:14195
#6  0x00007ffff79215d8 in ggml_graph_compute_thread (data=0x55555d8445f0)
    at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:15203
#7  0x00007ffff7921f0c in ggml_graph_compute._omp_fn.0(void) ()
    at /home/tmp/llamacpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:15478
#8  0x00007ffff7f0b637 in gomp_thread_start (xdata=<optimized out>) at /usr/src/debug/gcc/gcc/libgomp/team.c:129
#9  0x00007ffff40a370a in start_thread (arg=<optimized out>) at pthread_create.c:448
#10 0x00007ffff4127aac in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

The lines:

memcpy(
((char *) dst->data + ie0*nb0),
((char *) src0->data + ie0*nb0),
(ie1 - ie0) * nb0);

@cpg314
Copy link
Author

cpg314 commented Mar 12, 2025

It runs fine when skipping what I suppose is a fast path when source and destination are contiguous:

diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
index f2ab4c5d..72d3ba3c 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
@@ -4063,11 +4063,6 @@ static void ggml_compute_forward_dup_bytes(

     GGML_TENSOR_UNARY_OP_LOCALS;

-    if (ggml_is_contiguous(src0) && ggml_is_contiguous(dst)) {
-        ggml_compute_forward_dup_same_cont(params, dst);
-        return;
-    }
-
     const size_t type_size = ggml_type_size(src0->type);
     const int ith = params->ith; // thread index
     const int nth = params->nth; // number of threads

@ggerganov
Copy link
Member

Could you test this PR: #12310 and see if it fixes the issue.

@cpg314
Copy link
Author

cpg314 commented Mar 13, 2025

@ggerganov I confirm that the segfault does not happen with this PR, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants