Standardization of additional token identifiers across pipelines

`FluxPipeline` has utilities that give us `img_ids` and `txt_ids`:

https://github.com/huggingface/diffusers/blob/ce1063acfa0cbc2168a7e9dddd4282ab8013b810/src/diffusers/pipelines/flux/pipeline_flux.py#L514

https://github.com/huggingface/diffusers/blob/ce1063acfa0cbc2168a7e9dddd4282ab8013b810/src/diffusers/pipelines/flux/pipeline_flux.py#L385

As such these are not created inside the `transformer` class. 

Whereas in `HiDream`, we have something different.

`text_ids` are created inside the `transformer` class:
https://github.com/huggingface/diffusers/blob/ce1063acfa0cbc2168a7e9dddd4282ab8013b810/src/diffusers/models/transformers/transformer_hidream_image.py#L796

`img_ids` are overwritten:
https://github.com/huggingface/diffusers/blob/ce1063acfa0cbc2168a7e9dddd4282ab8013b810/src/diffusers/models/transformers/transformer_hidream_image.py#L771C13-L771C20 (probably intentional because it's conditioned)

Then the entire computation

https://github.com/huggingface/diffusers/blob/ce1063acfa0cbc2168a7e9dddd4282ab8013b810/src/diffusers/pipelines/hidream_image/pipeline_hidream_image.py#L726-L744

happens inside the pipeline `__call__()`. Maybe this could take place inside a method similar to the `FluxPipeline`? 

In general, these could be standardized a bit. 

Cc: @yiyixuxu @a-r-r-o-w 

	if latents.shape[-2] != latents.shape[-1]:
	B, C, H, W = latents.shape
	pH, pW = H // self.transformer.config.patch_size, W // self.transformer.config.patch_size

	img_sizes = torch.tensor([pH, pW], dtype=torch.int64).reshape(-1)
	img_ids = torch.zeros(pH, pW, 3)
	img_ids[..., 1] = img_ids[..., 1] + torch.arange(pH)[:, None]
	img_ids[..., 2] = img_ids[..., 2] + torch.arange(pW)[None, :]
	img_ids = img_ids.reshape(pH * pW, -1)
	img_ids_pad = torch.zeros(self.transformer.max_seq, 3)
	img_ids_pad[: pH * pW, :] = img_ids

	img_sizes = img_sizes.unsqueeze(0).to(latents.device)
	img_ids = img_ids_pad.unsqueeze(0).to(latents.device)
	if self.do_classifier_free_guidance:
	img_sizes = img_sizes.repeat(2 * B, 1)
	img_ids = img_ids.repeat(2 * B, 1, 1)
	else:
	img_sizes = img_ids = None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standardization of additional token identifiers across pipelines #11334

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Standardization of additional token identifiers across pipelines #11334

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions