Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: Gemma3 <unused32> spam #12433

Closed
mattjcly opened this issue Mar 17, 2025 · 6 comments · Fixed by #12615
Closed

Eval bug: Gemma3 <unused32> spam #12433

mattjcly opened this issue Mar 17, 2025 · 6 comments · Fixed by #12615
Labels
bug Something isn't working

Comments

@mattjcly
Copy link
Contributor

mattjcly commented Mar 17, 2025

Name and Version

> llama-gemma3-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
version: 4902 (cf2270e4)
built with MSVC 19.29.30158.0 for

Operating systems

Windows

GGML backends

CUDA

Hardware

AMD Ryzen 7 5800X 8-Core
NVIDIA GeForce RTX 3090 Ti
NVIDIA GeForce RTX 4060 Ti

Models

gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf + gemma-3-4b-it-GGUF/mmproj-model-f16.gguf

https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF

Problem description & steps to reproduce

llama-gemma3-cli outputs <unused32> infinitely in certain situations.

Reproduction:

  1. Load the model as such:
>  llama-gemma3-cli.exe -m gemma-3-4b-it-GGUF\gemma-3-4b-it-Q4_K_M.gguf --mmproj gemma-3-4b-it-GGUF\mmproj-model-f16.gguf -ngl 99 --temp 0.0 --seed 0 -c 4096
  1. Have the model first generate a story with prompt Tell me a long story, to completion:
<snipped>
 Running in chat mode, available commands:
   /image <path>    load an image
   /clear           clear the chat history
   /quit or /exit   exit the program
> Tell me a long story
  1. Have the model eval the following image:

Image

<snipped>
*   Change the ending?
*   Tell a different story altogether?
> /image dice.jpg
  1. Ask the model what they are:
> /image dice.jpg
Encoding image dice.jpg
Image encoded in 589 ms
Image decoded in 295 ms
> What are these?
  1. Observe <unused32> spam
> What are these?
<unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32>...

First Bad Commit

No response

Relevant log output

C:\Users\User>  C:\Users\User\Downloads\llama-b4902-bin-win-cuda-cu11.7-x64\llama-gemma3-cli.exe -m C:\Users\User\.cache\lm-studio\models\ggml-org\gemma-3-4b-it-GGUF\gemma-3-4b-it-Q4_K_M.gguf --mmproj C:\Users\User\.cache\lm-studio\models\ggml-org\gemma-3-4b-it-GGUF\mmproj-model-f16.gguf -ngl 99 --temp 0.0 --seed 0 -c 4096
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
build: 4902 (cf2270e4) with MSVC 19.29.30158.0 for
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090 Ti) - 23267 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4060 Ti) - 15209 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 444 tensors from C:\Users\User\.cache\lm-studio\models\ggml-org\gemma-3-4b-it-GGUF\gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 4b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  37:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_K:  204 tensors
llama_model_loader: - type q6_K:   35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 2.31 GiB (5.12 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma 3 4b It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors:        CUDA0 model buffer size =  1185.55 MiB
load_tensors:        CUDA1 model buffer size =  1182.63 MiB
load_tensors:   CPU_Mapped model buffer size =   525.00 MiB
................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
init:      CUDA0 KV buffer size =   352.00 MiB
init:      CUDA1 KV buffer size =   192.00 MiB
llama_context: KV self size  =  544.00 MiB, K (f16):  272.00 MiB, V (f16):  272.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =   166.01 MiB
llama_context:      CUDA1 compute buffer size =   601.02 MiB
llama_context:  CUDA_Host compute buffer size =    69.02 MiB
llama_context: graph nodes  = 1367
llama_context: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
clip_init: loaded meta data with 16 key-value pairs and 439 tensors from C:\Users\User\.cache\lm-studio\models\ggml-org\gemma-3-4b-it-GGUF\mmproj-model-f16.gguf
clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_init: - kv   0:                       general.architecture str              = clip
clip_init: - kv   1:                        clip.projector_type str              = gemma3
clip_init: - kv   2:                      clip.has_text_encoder bool             = false
clip_init: - kv   3:                    clip.has_vision_encoder bool             = true
clip_init: - kv   4:                   clip.has_llava_projector bool             = false
clip_init: - kv   5:                     clip.vision.image_size u32              = 896
clip_init: - kv   6:                     clip.vision.patch_size u32              = 14
clip_init: - kv   7:               clip.vision.embedding_length u32              = 1152
clip_init: - kv   8:            clip.vision.feed_forward_length u32              = 4304
clip_init: - kv   9:                 clip.vision.projection_dim u32              = 2560
clip_init: - kv  10:                    clip.vision.block_count u32              = 27
clip_init: - kv  11:           clip.vision.attention.head_count u32              = 16
clip_init: - kv  12:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_init: - kv  13:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_init: - kv  14:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_init: - kv  15:                              clip.use_gelu bool             = true
clip_init: - type  f32:  276 tensors
clip_init: - type  f16:  163 tensors
clip_ctx: CLIP using CUDA0 backend
key clip.use_silu not found in file
clip_init: params backend buffer size =  811.79 MB (439 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.feature_layer not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_init:      CUDA0 compute buffer size =  1128.81 MiB
clip_init:        CPU compute buffer size =     9.19 MiB
main: C:\Users\User\.cache\lm-studio\models\ggml-org\gemma-3-4b-it-GGUF\gemma-3-4b-it-Q4_K_M.gguf
 Running in chat mode, available commands:
   /image <path>    load an image
   /clear           clear the chat history
   /quit or /exit   exit the program
> Tell me a long story
Okay, settle in. This is a story about a lighthouse keeper named Silas Blackwood, a forgotten melody, and a storm that whispered secrets.
Silas Blackwood wasnΓÇÖt a man of many words. HeΓÇÖd been the keeper of the North Point Lighthouse for thirty-seven years, a solitary existence on the jagged, windswept coast of the Isle of Skye in Scotland. The lighthouse, a towering granite sentinel named ΓÇ£The SerpentΓÇÖs ToothΓÇ¥ by the locals, was his entire world. HeΓÇÖd inherited the post from his father, and his father before him, a lineage of men bound to the rhythm of the sea and the insistent blink of the light.
Silas wasnΓÇÖt a romantic. He didnΓÇÖt read poetry or dream of grand adventures. He simply maintained the light, polished the brass, checked the lenses, and listened. He listened to the wind, the waves, the cries of the gulls, and, most importantly, he listened to the silence.
The silence was broken, occasionally, by the arrival of supply ships ΓÇô gruff men with weathered faces and a cargo of oil and provisions. TheyΓÇÖd exchange a few terse words with Silas, a nod of acknowledgement, and then be gone, swallowed by the mist.  Silas rarely spoke to anyone. HeΓÇÖd found a strange comfort in the isolation, a quiet that allowed him to hear things others couldnΓÇÖt.
One particularly bleak autumn evening, as a low, bruised purple sky threatened a storm, Silas discovered something unusual. He was meticulously cleaning the lens of the lamp when he heard it – a faint, almost ethereal melody. It wasn’t a song he recognized, not a traditional Scottish tune, not a sea shanty. It was…older.  It seemed to emanate from the very stone of the lighthouse, a delicate, mournful tune played on what sounded like a distant, forgotten flute.
He stopped his work, his hand frozen mid-polish. He listened intently, straining his ears against the rising wind. The melody persisted, weaving itself into the roar of the waves. It was beautiful, heartbreakingly so, and utterly perplexing.
Silas, a man of logic and routine, found himself captivated. He began to investigate, meticulously checking the structure of the lighthouse, searching for any sign of a hidden mechanism, a secret chamber. He found nothing. The melody continued, growing slightly louder as the storm gathered.
As the storm broke, unleashing a furious torrent of rain and wind, the melody intensified. It was accompanied by a strange, unsettling feeling ΓÇô a sense of being watched, of something ancient and powerful stirring within the stone.  He realized, with a chilling certainty, that the melody wasn't just *in* the lighthouse; it was *of* the lighthouse.
He began to research the history of The SerpentΓÇÖs Tooth. He poured over old maps, local legends, and the sparse records kept by his ancestors. He discovered that the lighthouse wasn't built on solid rock. It was constructed on a small, submerged island, a place known as ΓÇ£An CailleachΓÇÖs TearΓÇ¥ ΓÇô the WitchΓÇÖs Tear ΓÇô named after a legendary Celtic sorceress who was said to have drowned on the island centuries ago.
The legends spoke of An CailleachΓÇÖs grief, a sorrow so profound that it had become trapped within the stone, manifesting as the melody.  They said she was searching for her lost love, a fisherman named Eamon, who had vanished during a particularly violent storm.
Silas, driven by a compulsion he couldnΓÇÖt explain, began to spend his nights listening to the melody, trying to decipher its meaning. He realized that the tune wasnΓÇÖt just a lament; it was a *search*. It was a desperate, echoing plea.
Then, one night, as the storm raged with particular ferocity, he noticed something new. The melody shifted, subtly altering its notes. It seemed to be responding to the wind, to the waves, to the very storm itself. And then, he heard a voice, faint and ghostly, interwoven with the music.
It was a man’s voice, weathered and weary, repeating a single name: “Eamon…”
Silas, armed with this knowledge, began to act. He started leaving small offerings at the base of the lighthouse ΓÇô polished stones, wildflowers, a hand-carved wooden flute ΓÇô hoping to appease the spirit, to guide her. He spent hours studying the tides, the currents, trying to understand the patterns of the storm.
Finally, after weeks of relentless listening and searching, he realized the key. The melody wasnΓÇÖt just about finding Eamon; it was about *returning* him. The storm, he realized, wasnΓÇÖt just a force of nature; it was a conduit, a pathway through time.
On the night of the next full moon, as the storm reached its peak, Silas climbed to the top of the lighthouse, the wind whipping around him, the rain stinging his face. He closed his eyes, focused on the melody, and began to hum, matching its rhythm, echoing its plea.
Suddenly, the light of the lighthouse pulsed with an unnatural intensity. The air shimmered, and a swirling vortex of mist formed around the base of the tower.  And then, he saw him.
Not a ghost, not a phantom, but a man, drenched and shivering, standing on the small island, his eyes filled with a lifetime of sorrow. It was Eamon, miraculously returned from the depths of time.
Eamon didnΓÇÖt speak. He simply looked at Silas, a profound understanding passing between them. Then, he reached out, took SilasΓÇÖs hand, and with a final, heartbreaking smile, he faded away, dissolving back into the mist.
The melody ceased. The storm began to subside.
Silas Blackwood remained the keeper of The SerpentΓÇÖs Tooth, but he was no longer the same man. He had listened to the secrets of the sea, he had answered a forgotten plea, and he had, in a way, brought peace to a restless spirit.
He continued to polish the lens, check the lights, and listen to the silence. But now, he knew that the silence wasn't empty. It was filled with the echoes of a love lost and found, a testament to the enduring power of music, and a reminder that even the most solitary of lives could be touched by the magic of the past.
And sometimes, on the quietest nights, if you listen very carefully, you can still hear the faintest whisper of a flute, playing a mournful, beautiful tune, carried on the wind.
---
Would you like me to:
*   Expand on a particular part of the story?
*   Change the ending?
*   Tell a different story altogether?
> /image C:\Users\User\Documents\random-images\dice.jpg
Encoding image C:\Users\User\Documents\random-images\dice.jpg
Image encoded in 589 ms
Image decoded in 295 ms
> What are these?
<unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32><unused32>
@FDoKE
Copy link

FDoKE commented Mar 18, 2025

Same, at some point of dialogue where there image in context, model will only spit out <unusedNN> tokens

@ngxson
Copy link
Collaborator

ngxson commented Mar 18, 2025

Question: is this bug still there if we don't use vision? (i.e. text-only)

I'm trying to reduce the scope of this bug, it can be in one of these categories:

@ngxson ngxson added bug Something isn't working and removed bug-unconfirmed labels Mar 18, 2025
@Mozer
Copy link

Mozer commented Mar 19, 2025

To add some ideas:
I observed this bug only when using images.
I observed this or similar bug in LM studio and ollama.
I use cuda. I also tried running CPU only inference in LM studio and got the same error.

I didn't observe this bug in koboldcpp_cu12.exe when chatting with images - that is interesting, knowing it uses llama.cpp.

In ollama logs i see following error (but it may be another bug). it just crashes after some chat with images:

[GIN] 2025/03/14 - 20:24:02 | 200 | 6.0961414s | 127.0.0.1 | POST "/api/chat"
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\im2col.cu:72: GGML_ASSERT(src1->type == GGML_TYPE_F32) failed
[GIN] 2025/03/14 - 20:24:10 | 200 | 1.1510863s | 127.0.0.1 | POST "/api/chat"
time=2025-03-14T20:24:10.253+03:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"
time=2025-03-14T20:29:15.237+03:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0550955 model=sha256-377655e65351a68cddfbd69b7c8dc60c1890466254628c3e494661a52c2c5ada

@Mozer
Copy link

Mozer commented Mar 19, 2025

Update: I tried recent build https://github.com/ggml-org/llama.cpp/releases/tag/b4924
Using llama-gemma3-cli.exe I cannot reproduce this bug anymore. Tried with some images in a row. Can anyone confirm?

Update 2: No, I still got same bug again after some time and 6 images:
describe this image
unused32 unused32 unused32

@Mozer
Copy link

Mozer commented Mar 21, 2025

If I offload 34 of 35 layers of gemma_4b to GPU after a while and 4 images I get similar error. Notice that the tag is changed: <unused23><unused22><mask><unused11><unused33><unused29><bos><unk><unused29><unused5><unused2><unused32><mask><unused1><unused23><unused2><unused17><unused1><bos>

llama-gemma3-cli.exe -c 8192 -ngl 34 --main-gpu 0 --split-mode none -fa -m C:\DATA\LLM\models\Google\gemma-3-4b-it-GGUF\gemma-3-4b-it-Q4_K_M.gguf --mmproj C:\DATA\LLM\models\Google\gemma-3-4b-it-GGUF\mmproj-model-f16.gguf

@caith-h
Copy link

caith-h commented Mar 27, 2025

Question: is this bug still there if we don't use vision? (i.e. text-only)

yes. in long context situations (50~100k tokens) the unused32 token appears more and more often. regardless of precision. (it also happens at fp16, fully loaded to gpu, with context set to 128k tokens)

vllm does not output the unused32 token for the same prompts

using gemma3-27b-it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants