Skip to content

HunyuanVideoImageToVideoPipeline failures #11118

@vladmandic

Description

@vladmandic

Describe the bug

pipline HunyuanVideoImageToVideoPipeline fails with latest combination of diffusers and transformers libraries

first, minor issue is with offloading - this snipped updates pipeline_hunyuan_video_image2video.py to add explicit .to(device) so two torch.cat operations do not fail.

            if last_double_return_token_indices.shape[0] == 3:
                # in case the prompt is too long
                last_double_return_token_indices = torch.cat(
                    (last_double_return_token_indices, torch.tensor([text_input_ids.shape[-1]], device=last_double_return_token_indices.device))
                )
                batch_indices = torch.cat((batch_indices, torch.tensor([0], device=batch_indices.device)))

bigger issue is that transformers updated how image embeds work in LlavaForConditionalGeneration,
so function _get_llama_prompt_embeds in HunyuanVideoImageToVideoPipeline needs an update
(last version of transformers that works is transformers==4.47.1)

specifically, it returns prompt_embeds and prompt_attention_mask which don't have the same length due to way that cropping is implemented, so later cannot be combined in HunyuanVideoTokenRefiner:

Reproduction

see #10983 for simple example

Logs

│ /home/vlado/dev/sdnext/venv/lib/python3.12/site-packages/diffusers/models/transformers/transformer_hunyuan_video.py:312 in forward                                                                                                                                                                                                                                                                                                    │
│                                                                                                                                                                                                                                                                                                                                                                                                                                       │
│   311 │   │   │   mask_float = attention_mask.float().unsqueeze(-1)                                                                                                                                                                                                                                                                                                                                                                   │
│ ❱ 312 │   │   │   pooled_projections = (hidden_states * mask_float).sum(dim=1) / mask_float.sum(dim=1)                                                                                                                                                                                                                                                                                                                                │
│   313 │   │   │   pooled_projections = pooled_projections.to(original_dtype)                                                                                                                                                                                                                                                                                                                                                          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (177) must match the size of tensor b (429) at non-singleton dimension 1

System Info

diffusers==main
transformers==4.49.0

Who can help?

@DN6 @a-r-r-o-w

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions