-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Description
Describe the bug
pipline HunyuanVideoImageToVideoPipeline
fails with latest combination of diffusers
and transformers
libraries
first, minor issue is with offloading - this snipped updates pipeline_hunyuan_video_image2video.py
to add explicit .to(device)
so two torch.cat operations do not fail.
if last_double_return_token_indices.shape[0] == 3:
# in case the prompt is too long
last_double_return_token_indices = torch.cat(
(last_double_return_token_indices, torch.tensor([text_input_ids.shape[-1]], device=last_double_return_token_indices.device))
)
batch_indices = torch.cat((batch_indices, torch.tensor([0], device=batch_indices.device)))
bigger issue is that transformers
updated how image embeds work in LlavaForConditionalGeneration
,
so function _get_llama_prompt_embeds
in HunyuanVideoImageToVideoPipeline
needs an update
(last version of transformers that works is transformers==4.47.1
)
specifically, it returns prompt_embeds
and prompt_attention_mask
which don't have the same length due to way that cropping is implemented, so later cannot be combined in HunyuanVideoTokenRefiner
:
Reproduction
see #10983 for simple example
Logs
│ /home/vlado/dev/sdnext/venv/lib/python3.12/site-packages/diffusers/models/transformers/transformer_hunyuan_video.py:312 in forward │
│ │
│ 311 │ │ │ mask_float = attention_mask.float().unsqueeze(-1) │
│ ❱ 312 │ │ │ pooled_projections = (hidden_states * mask_float).sum(dim=1) / mask_float.sum(dim=1) │
│ 313 │ │ │ pooled_projections = pooled_projections.to(original_dtype) │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (177) must match the size of tensor b (429) at non-singleton dimension 1
System Info
diffusers==main
transformers==4.49.0
Who can help?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status