Skip to content

Bug: Assertion '__n < this->size()' failed. #9636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Luke100000 opened this issue Sep 25, 2024 · 1 comment
Closed

Bug: Assertion '__n < this->size()' failed. #9636

Luke100000 opened this issue Sep 25, 2024 · 1 comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale

Comments

@Luke100000
Copy link

What happened?

When using an embedding model via Ollamas API, Lama.cpp has an assertion error: Bug: Assertion '__n < this->size()' failed.

I tried nomic-embed-text-v1.5 and all-minilm.
It works fine if 100% CPU.

#7592 could be related

Name and Version

0.3.6 ollama-cuda from AUR, was not able to find the used lamacpp version.

What operating system are you seeing the problem on?

Linux

Relevant log output

Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.840+02:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-e919e64e-b05e-1b0e-79fe-4d6f163c34c8 parallel=4 available=11899699200 required="1.0 GiB"
Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.840+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[11.1 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"
Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.842+02:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama140604727/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 4 --port 35395"
Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.842+02:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.842+02:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.842+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
Sep 25 15:40:18 hostname ollama[269097]: INFO [main] build info | build=3535 commit="1e6f6554a" tid="140699372605440" timestamp=1727271618
Sep 25 15:40:18 hostname ollama[269097]: INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140699372605440" timestamp=1727271618 total_threads=12
Sep 25 15:40:18 hostname ollama[269097]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="35395" tid="140699372605440" timestamp=1727271618
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   8:                          general.file_type u32              = 1
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  22:                tokenizer.ggml.cls_token_id u32              = 101
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv  23:               tokenizer.ggml.mask_token_id u32              = 103
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - type  f32:   51 tensors
Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - type  f16:   61 tensors
Sep 25 15:40:18 hostname ollama[268657]: llm_load_vocab: special tokens cache size = 5
Sep 25 15:40:18 hostname ollama[268657]: llm_load_vocab: token to piece cache size = 0.2032 MB
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: arch             = nomic-bert
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: vocab type       = WPM
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_vocab          = 30522
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_merges         = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: vocab_only       = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_ctx_train      = 2048
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd           = 768
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_layer          = 12
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_head           = 12
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_head_kv        = 12
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_rot            = 64
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_swa            = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd_head_k    = 64
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd_head_v    = 64
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_gqa            = 1
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd_k_gqa     = 768
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd_v_gqa     = 768
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_norm_eps       = 1.0e-12
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_ff             = 3072
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_expert         = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_expert_used    = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: causal attn      = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: pooling type     = 1
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: rope type        = 2
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: rope scaling     = linear
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: freq_base_train  = 1000.0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: freq_scale_train = 1
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_ctx_orig_yarn  = 2048
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: rope_finetuned   = unknown
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: ssm_d_conv       = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: ssm_d_inner      = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: ssm_d_state      = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: model type       = 137M
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: model ftype      = F16
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: model params     = 136.73 M
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: model size       = 260.86 MiB (16.00 BPW)
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: general.name     = nomic-embed-text-v1.5
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: BOS token        = 101 '[CLS]'
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: EOS token        = 102 '[SEP]'
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: UNK token        = 100 '[UNK]'
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: SEP token        = 102 '[SEP]'
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: PAD token        = 0 '[PAD]'
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: CLS token        = 101 '[CLS]'
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: MASK token       = 103 '[MASK]'
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: LF token         = 0 '[PAD]'
Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: max token length = 21
Sep 25 15:40:18 hostname ollama[268657]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 25 15:40:18 hostname ollama[268657]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 25 15:40:18 hostname ollama[268657]: ggml_cuda_init: found 1 CUDA devices:
Sep 25 15:40:18 hostname ollama[268657]:   Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Sep 25 15:40:18 hostname ollama[268657]: llm_load_tensors: ggml ctx size =    0.10 MiB
Sep 25 15:40:19 hostname ollama[268657]: time=2024-09-25T15:40:19.093+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors: offloading 12 repeating layers to GPU
Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors: offloaded 13/13 layers to GPU
Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors:        CPU buffer size =    44.72 MiB
Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors:      CUDA0 buffer size =   216.15 MiB
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: n_ctx      = 32768
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: n_batch    = 512
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: n_ubatch   = 512
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: flash_attn = 0
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: freq_base  = 1000.0
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: freq_scale = 1
Sep 25 15:40:19 hostname ollama[268657]: llama_kv_cache_init:      CUDA0 KV buffer size =  1152.00 MiB
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: KV self size  = 1152.00 MiB, K (f16):  576.00 MiB, V (f16):  576.00 MiB
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model:      CUDA0 compute buffer size =    22.01 MiB
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model:  CUDA_Host compute buffer size =     2.51 MiB
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: graph nodes  = 453
Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: graph splits = 2
Sep 25 15:40:19 hostname ollama[269097]: [1727271619] warming up the model with an empty run
Sep 25 15:40:19 hostname ollama[268657]: /usr/include/c++/14.2.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
Sep 25 15:40:20 hostname ollama[268657]: time=2024-09-25T15:40:20.297+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server not responding"
Sep 25 15:40:21 hostname ollama[268657]: time=2024-09-25T15:40:21.187+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
Sep 25 15:40:21 hostname ollama[268657]: time=2024-09-25T15:40:21.438+02:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)"
Sep 25 15:40:21 hostname ollama[268657]: [GIN] 2024/09/25 - 15:40:21 | 500 |  2.676901285s |       127.0.0.1 | POST     "/api/embed"
Sep 25 15:40:26 hostname ollama[268657]: time=2024-09-25T15:40:26.512+02:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.07381477 model=/var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
Sep 25 15:40:26 hostname ollama[268657]: time=2024-09-25T15:40:26.761+02:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.323291756 model=/var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
Sep 25 15:40:27 hostname ollama[268657]: time=2024-09-25T15:40:27.012+02:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.573759559 model=/var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
@Luke100000 Luke100000 added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Sep 25, 2024
@github-actions github-actions bot added the stale label Oct 26, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale
Projects
None yet
Development

No branches or pull requests

1 participant