Need help related to T5EncoderModel #10735

nitinmukesh · 2025-02-06T09:49:09Z

nitinmukesh
Feb 6, 2025

So I was getting this warning during inference and figured this has to do with Clip. So I should be using T5 text encoder.

Token indices sequence length is longer than the specified maximum sequence length for this model (98 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['her head and her right hand is touching her hair. the background shows a building with a glass facade.']

import torch
from diffusers import FluxPipeline, FluxTransformer2DModel
from diffusers import GGUFQuantizationConfig
from transformers import T5EncoderModel

model_id = "ostris/Flex.1-alpha"
dtype = torch.bfloat16

transformer_path = "https://huggingface.co/hum-ma/Flex.1-alpha-GGUF/blob/main/Flex.1-alpha-Q6_K.gguf"
transformer = FluxTransformer2DModel.from_single_file(
	transformer_path,
	quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
	torch_dtype=dtype,
	config=model_id,
	subfolder="transformer"
)

text_encoder_2 = T5EncoderModel.from_pretrained(
	model_id, 
	subfolder="text_encoder_2",
	torch_dtype=dtype
)
pipe = FluxPipeline.from_pretrained(
	model_id,
	transformer=transformer,
	text_encoder_2=text_encoder_2,
	torch_dtype=dtype,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

inference_params = {
	"prompt": "The image is a high-resolution photograph featuring a young Japanese girl sitting on a concrete ledge by a swimming pool. She is wearing a blue school uniform with a white shirt and a blue plaid skirt. She has long dark hair with bangs and is wearing black tights. The girl is looking directly at the camera with a serious expression on her face. Her left hand is resting on her head and her right hand is touching her hair. The background shows a building with a glass facade.",
	"negative_prompt": "",
	"height": 512,
	"width": 512,
	"guidance_scale": 1,
	"num_inference_steps": 20,
	"generator": torch.Generator(device="cpu").manual_seed(0),
	"max_sequence_length":512,
}
image = pipe(**inference_params).images[0]
image.save("flex_test.png")

Even after using T5 still the same error, which means T5 is not used. Any guidance plz what is wrong in code.

(venv) C:\aiOWN\diffuser_webui>python flex_t5.py
Downloading shards: 100%|███████████████████████████████████████████████████| 2/2 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|████████████████████████████████████| 2/2 [00:00<00:00,  2.75it/s]
Loading pipeline components...:  29%|████████▊                      | 2/7 [00:00<00:00, 13.17it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████| 7/7 [00:00<00:00,  7.97it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (98 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['her head and her right hand is touching her hair. the background shows a building with a glass facade.']

If I set text_encoder=None in pipe, it gives another error NoneType and doesn't work at all.
I did referred few other threads and it is used the same way.
#10728

Answered by asomoza

Feb 6, 2025

see answer above.

Also, even though that's a solution, compel has very basic support for long prompts and weightings, it's really not easy to use the prompts shared by other users in other UIs (you need to convert them) and hasn't been updated in a long time. I don't even know if it supports Flux or the newer models that still use the clip models.

That's a really old solution and not the best one, maybe we should update it so people use sd_embed, what do you think @stevhliu

View full answer

asomoza · 2025-02-06T16:10:09Z

asomoza
Feb 6, 2025
Maintainer

HI, that's not an error, it's a warning telling you that the prompt for the clip model will be truncated since the model only supports up to 77 tokens. This is only for the clip model and not the T5, the T5 will use the full prompt.

You can search a workaround about it but it's a really common and known issue that we've discussed multiple times in different issues and discussions.

Even though I answered this multiple times, the fast answer is that if you want to use more tokens you will need to use some kind of strategy to pass the model the tokens, I recommend a library called sd_embed for this.

0 replies

nitinmukesh · 2025-02-06T17:05:28Z

nitinmukesh
Feb 6, 2025
Author

My understanding is if I use T5 (text_encoder_2) this warning should not appear, and that's what I added in code. In this case T5 should be used. My understanding wrong? How to force T5 and ignore Clip. I read your response again, so this is known issue and need to be fixed in diffusers.

text_encoder_2 = T5EncoderModel.from_pretrained(
model_id,
subfolder="text_encoder_2",
torch_dtype=dtype
)
pipe = FluxPipeline.from_pretrained(
model_id,
transformer=transformer,
text_encoder_2=text_encoder_2,
torch_dtype=dtype,
)

I am looking at https://github.com/xhinker/sd_embed, thanks for the repo link

2 replies

asomoza Feb 6, 2025
Maintainer

you're misunderstanding this part, you're not using just the T5, you're using both of them (AFAIK Flux needs both of them). With this:

pipe = FluxPipeline.from_pretrained(
  model_id,
  transformer=transformer,
  text_encoder_2=text_encoder_2,
  torch_dtype=dtype,
)

You're not forcing the pipeline to just use the T5, you're just manually loading the T5 and passing it to the pipeline.

This is pretty simple, if you get the warning it's because the tokenizer for the clip model is being used, just because the tokenizer is the one that throws that warning.

asomoza Feb 6, 2025
Maintainer

so this is known issue and need to be fixed in diffusers.

Maybe I wrote it wrong, there's been multiple issues about this, this is not an error or a problem with diffusers, it's a limitation of the clip models, same as like as some LLM have lower context windows than other or some diffusion models have lower resolutions or higher resolutions.

To put it in simple words, the original clip models were trained in a low token count, they have this limitation and even though people say they fix this, they aren't, they are just use some other techniques to overcome this limit at the cost of diluting the already basic prompt understanding of the clip model.

nitinmukesh · 2025-02-06T17:29:43Z

nitinmukesh
Feb 6, 2025
Author

Solution is already posted here
https://huggingface.co/docs/diffusers/using-diffusers/weighted_prompts#stable-diffusion-xl

But question remains, using T5 should solve the limit of 77 tokens, but it is not. Why?

3 replies

asomoza Feb 6, 2025
Maintainer

see answer above.

Also, even though that's a solution, compel has very basic support for long prompts and weightings, it's really not easy to use the prompts shared by other users in other UIs (you need to convert them) and hasn't been updated in a long time. I don't even know if it supports Flux or the newer models that still use the clip models.

That's a really old solution and not the best one, maybe we should update it so people use sd_embed, what do you think @stevhliu

Answer selected by nitinmukesh

nitinmukesh Feb 7, 2025
Author

Thank you for explaining. I am gonna try sd_embed instead of Compel.

stevhliu Feb 7, 2025
Maintainer

Sounds good, I'll work on replacing Compel with sd_embed!

Need help related to T5EncoderModel #10735

Uh oh!

Uh oh!

nitinmukesh Feb 6, 2025

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

asomoza Feb 6, 2025 Maintainer

Uh oh!

Uh oh!

nitinmukesh Feb 6, 2025 Author

Uh oh!

asomoza Feb 6, 2025 Maintainer

Uh oh!

asomoza Feb 6, 2025 Maintainer

Uh oh!

Uh oh!

nitinmukesh Feb 6, 2025 Author

Uh oh!

Uh oh!

asomoza Feb 6, 2025 Maintainer

Uh oh!

nitinmukesh Feb 7, 2025 Author

Uh oh!

stevhliu Feb 7, 2025 Maintainer

nitinmukesh
Feb 6, 2025

Replies: 3 comments 5 replies

asomoza
Feb 6, 2025
Maintainer

nitinmukesh
Feb 6, 2025
Author

asomoza Feb 6, 2025
Maintainer

asomoza Feb 6, 2025
Maintainer

nitinmukesh
Feb 6, 2025
Author

asomoza Feb 6, 2025
Maintainer

nitinmukesh Feb 7, 2025
Author

stevhliu Feb 7, 2025
Maintainer