HunyuanVideoImageToVideoPipeline failures

### Describe the bug

pipline `HunyuanVideoImageToVideoPipeline` fails with latest combination of `diffusers` and `transformers` libraries

first, minor issue is with offloading - this snipped updates `pipeline_hunyuan_video_image2video.py` to add explicit `.to(device)` so two torch.cat operations do not fail.
```py
            if last_double_return_token_indices.shape[0] == 3:
                # in case the prompt is too long
                last_double_return_token_indices = torch.cat(
                    (last_double_return_token_indices, torch.tensor([text_input_ids.shape[-1]], device=last_double_return_token_indices.device))
                )
                batch_indices = torch.cat((batch_indices, torch.tensor([0], device=batch_indices.device)))
```

bigger issue is that `transformers` updated how image embeds work in `LlavaForConditionalGeneration`,  
so function `_get_llama_prompt_embeds` in `HunyuanVideoImageToVideoPipeline` needs an update  
(last version of transformers that works is `transformers==4.47.1`)  

specifically, it returns `prompt_embeds` and `prompt_attention_mask` which don't have the same length due to way that cropping is implemented, so later cannot be combined in `HunyuanVideoTokenRefiner`:

### Reproduction

see #10983 for simple example

### Logs

```shell
│ /home/vlado/dev/sdnext/venv/lib/python3.12/site-packages/diffusers/models/transformers/transformer_hunyuan_video.py:312 in forward                                                                                                                                                                                                                                                                                                    │
│                                                                                                                                                                                                                                                                                                                                                                                                                                       │
│   311 │   │   │   mask_float = attention_mask.float().unsqueeze(-1)                                                                                                                                                                                                                                                                                                                                                                   │
│ ❱ 312 │   │   │   pooled_projections = (hidden_states * mask_float).sum(dim=1) / mask_float.sum(dim=1)                                                                                                                                                                                                                                                                                                                                │
│   313 │   │   │   pooled_projections = pooled_projections.to(original_dtype)                                                                                                                                                                                                                                                                                                                                                          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (177) must match the size of tensor b (429) at non-singleton dimension 1
```

### System Info

diffusers==main
transformers==4.49.0

### Who can help?

@DN6 @a-r-r-o-w

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HunyuanVideoImageToVideoPipeline failures #11118

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HunyuanVideoImageToVideoPipeline failures #11118

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions