Closed as duplicate of#12354
Closed as duplicate of#12354
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 4850 (ea002810)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
Operating systems
Linux
GGML backends
CUDA, CPU
Hardware
NVIDIA RTX 3060 12 GB VRAM and an AMD Ryzen 9 7900
Models
Mistral Small 24B 2501 with Q4_K_M quantization and q4_0 KV cache quantization
Problem description & steps to reproduce
- A long message is passed to
llama-server
(started with-c 2048
), producing a long response. - The cache reaches the size of the context.
- There is a context shift
- There is a defrag
- The server segfaults
(see logs below)
The segfault does NOT happen:
- with
-c 1024
or with-c 2056
- when I disable defragmentation with
-dt 0
First Bad Commit
No response
Relevant log output
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 2048, n_past = 2047, n_cache_tokens = 2047, truncated = 1
srv update_slots: decoding batch, n_tokens = 1
slot process_toke: id 0 | task 0 | n_decoded = 363, n_remaining = -1, next token: 4440 ' document'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 366
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 367, front = 0
slot update_slots: id 0 | task 0 | slot context shift, n_keep = 1, n_left = 2046, n_discard = 1023
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 2048, n_past = 1025, n_cache_tokens = 1025, truncated = 1
srv update_slots: decoding batch, n_tokens = 1
llama_decode_impl: fragmentation: 0.37 - requesting defrag
slot process_toke: id 0 | task 0 | n_decoded = 364, n_remaining = -1, next token: 1395 ' is'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 367
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 368, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 2048, n_past = 1026, n_cache_tokens = 1026, truncated = 1
srv update_slots: decoding batch, n_tokens = 1
fish: Job 1, './build/bin/llama-server -m /ho…' terminated by signal SIGSEGV (Address boundary error)